我正在通过浏览朋友(结构化,如果笨重)网站的过程获得我的BeautifulSoup和python轴承,其长期目标是将整个内容迁移到内容管理系统中。
如果我在控制台中完全拔出一个单元格(soup = BeautifulSoup(urllib2.urlopen("http://www.bicyclepaintings.com/archive/index.html")
):
cell = soup.find_all('td',{'valign':'bottom'})[3]
我可以玩拉出子串。这些都可以正常工作:cell.br.next_sibling
,cell.find('b').text
。但是当我尝试使用for循环遍历所有单元格时:
def parse_archive(url):
soup = get_soup(url)
paintings = []
for cell in soup.find_all('td',{'valign':'bottom'}):
painting_title = cell.find('b').text
painting_media = cell.br.next_sibling
record = painting_title, painting_media
paintings.append(record)
return paintings
我收到属性错误(AttributeError: 'NoneType' object has no attribute 'text'
)。我可以通过循环回来获得一些相同的信息:
for item in cell.find_all('b'):
painting_title = item.text
但我没有办法让兄弟姐妹到<br/>
和(更重要的是)我不明白为什么它如果我拉出一个项目但是如果我试图访问它不起作用他们通过for循环。我在这里错过了什么?
答案 0 :(得分:0)
您的问题是,您尝试抓取的网站最后有一堆<td>
代码,但不包含<b>
代码:
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use " for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use " for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use " for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use " for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use " for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use " for quotes --></font></p></td>
您只需修改代码即可忽略这些标记:
for cell in soup.find_all('td',{'valign':'bottom'}):
title = cell.find('b')
if title is None:
continue
painting_title = title.text
painting_media = cell.br.next_sibling
record = painting_title, painting_media
paintings.append(record)
只要匹配painting_media
,您就可以使用:
painting_media = list(cell.br.children)
painting_media = painting_media[0].strip() if painting_media else ''