Question

我正在用Python编写一个小文本抓取脚本。这是我的第一个更大的项目，所以我遇到了一些问题。我正在使用urllib2和BeautifulSoup。我想从一个播放列表中删除歌曲名称。我可以获得一首歌名或所有歌曲名称+其他我不需要的字符串。我无法获得所有歌曲名称。我的代码获取所有歌曲名称+其他我不需要的字符串：

import urllib2
from bs4 import BeautifulSoup
import re

response = urllib2.urlopen('http://guardsmanbob.com/media/playlist.php?char=a').read()
soup = BeautifulSoup(response)

for tr in soup.findAll('tr')[0]:
    for td in soup.findAll('a'):
        print td.contents[0]

给我一首歌的代码：

print soup.findAll('tr')[1].findAll('a')[0].contents[0]

它实际上不是一个循环，所以我不能只获得一个，但如果我试着让它循环，我得到了10个相同的歌曲名称。那段代码：

for tr in soup.findAll('tr')[1]:
    for td in soup.findAll('td')[0]:
        print td.contents[0]

我现在已经坚持了一天，我无法让它发挥作用。我不明白这些事情是如何运作的。

Answer 1

for tr in soup.findAll('tr'):  # 1
    if not tr.find('td'): continue  # 2
    for td in tr.find('td').findAll('a'):  # 3
        print td.contents[0]

您希望迭代所有tr，因此findAll('tr')代替findAll('tr') [0]。
有些行不包含td，因此我们需要跳过它们以避免AttributeError（尝试删除此行）
与1中一样，你想要第一个td中的所有a，但也是 “for td in tr.find”，而非“for td in soup.find”，因为您希望查看tr不在整个文档中soup）。

Answer 2

您应该在搜索中更具体一点，然后循环遍历表格行;通过css类抓取特定的表，使用切片循环遍历tr元素除了第一个元素，从第一个td抓取所有文本：

table = soup.find('table', class_='data-table')
for row in table.find_all('tr')[1:]:
    print ''.join(row.find('td').stripped_strings)

除了切掉第一行之外，您可以通过测试来跳过thead：

for row in table.find_all('tr'):
    if row.parent.name == 'thead':
        continue
    print ''.join(row.find('td').stripped_strings)

如果页面使用了正确的<tbody>标记，那将会更好。： - ）

BeautifulSoup和正则表达式 - 从标签中提取文本

2 个答案: