刮表

时间:2012-10-26 04:09:52

标签: python beautifulsoup urllib2

我正在通过浏览朋友(结构化,如果笨重)网站的过程获得我的BeautifulSoup和python轴承,其长期目标是将整个内容迁移到内容管理系统中。

如果我在控制台中完全拔出一个单元格(soup = BeautifulSoup(urllib2.urlopen("http://www.bicyclepaintings.com/archive/index.html")):

cell = soup.find_all('td',{'valign':'bottom'})[3]

我可以玩拉出子串。这些都可以正常工作:cell.br.next_siblingcell.find('b').text。但是当我尝试使用for循环遍历所有单元格时:

def parse_archive(url):
    soup = get_soup(url)
    paintings = []
    for cell in soup.find_all('td',{'valign':'bottom'}):
        painting_title = cell.find('b').text
        painting_media = cell.br.next_sibling 
        record = painting_title, painting_media
        paintings.append(record)
    return paintings

我收到属性错误(AttributeError: 'NoneType' object has no attribute 'text')。我可以通过循环回来获得一些相同的信息:

    for item in cell.find_all('b'):
        painting_title = item.text

但我没有办法让兄弟姐妹到<br/>和(更重要的是)我不明白为什么它如果我拉出一个项目但是如果我试图访问它不起作用他们通过for循环。我在这里错过了什么?

1 个答案:

答案 0 :(得分:0)

您的问题是,您尝试抓取的网站最后有一堆<td>代码,但不包含<b>代码:

<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>

您只需修改代码即可忽略这些标记:

for cell in soup.find_all('td',{'valign':'bottom'}):
    title = cell.find('b')
    if title is None:
        continue
    painting_title = title.text
    painting_media = cell.br.next_sibling 
    record = painting_title, painting_media
    paintings.append(record)

只要匹配painting_media,您就可以使用:

painting_media = list(cell.br.children)
painting_media = painting_media[0].strip() if painting_media else ''