使用Python 2.7中的BeautifulSoup从td内的链接获取文本

时间:2016-05-13 09:00:51

标签: python python-2.7 beautifulsoup

我正试图通过抓取来SELECT stories.id,stories.content,COUNT(stories.id) as totalcomment FROM stories JOIN comments ON stories.id=comments.story_id GROUP BY stories.id 抓取所有位置名称的列表,我曾经使用过以下内容:

BeautifulSoup

过去常用于HTML

locs = LOOPED.findAll("td", {"class": "max use"})

然而,HTML已更改为并且不再返回 <td class="max use" style="">London</td>

London

编辑:如果我打印了loc,我会得到一个列表:

<td class="max use" style=""> <div class="notranslate"> <span><a data-title="View Location" href="/location/uk/gb/london/">London</a></span> <span class="extra hidden">(DEFAULT)</span> </div> </td>

您可以看到其中有3个不同的位置,从上面我希望看到<td class="max use" style="">\n<div class="notranslate">\n<span><a data-title="View Location" href="/location/uk/gb/london/">London</a></span> <span class="extra hidden">(DEFAULT)</span>\n</div>\n</td>, <td class="max use" style="">\n<div class="notranslate">\n<span><a data-title="View Location" href="/location/uk/gb/manchester/">Manchester</a></span> <span class="extra hidden">(DEFAULT)</span>\n</div>\n</td>, <td class="max use" style="">\n<div class="notranslate">\n<span><a data-title="View Location" href="/location/uk/gb/liverpool/">Liverpool</a></span> <span class="extra hidden">(NA)</span>\n</div>\n</td>]的列表

我认为我应该使用类似的东西:

[London, Manchester, Liverpool]

但这只会随着

回归
  

AttributeError:'ResultSet'对象没有属性'findAll'

我无法弄清楚如何让locs = LOOPED.findAll("td", {"class": "max use"}) locs = locs.findAll('a')[1] print locs.text 重新搜索超链接文本......

2 个答案:

答案 0 :(得分:2)

试试这个:

tag = LOOPED.findAll('td') #all "td" tag in a list
tag_a = tag[0].find('a')
print tag_a.text

答案 1 :(得分:1)

对未来HTML结构更改更加健壮的方法是获取每个td元素中的所有文本,如this answer中所述:

locs = LOOPED.findAll("td", {"class": "max use"})
for loc in locs:
    print ''.join(loc.findAll(text=True))