Question

在这个HTML中我想得到它的字符串，但无论我尝试它不起作用（string = none）

      <a href="/analyze/default/index/49398962/1/34925733" target="_blank">
       <img alt="" class="ajax-tooltip shadow radius lazy" data-id="acctInfo:34925733_1" data-original="/upload/profileIconId/default.jpg" src="/images/common/transbg.png"/>
       Jue VioIe Grace
      </a>

页面上有一些这样的内容，我尝试了这个：

print([a.string for a in soup.findAll('td', class_='tou')])

输出只是没有。

编辑：这是整个页面的HTML，希望这有帮助，只是为了澄清，我需要找到像上面那样的所有实例并提取它们的字符串

http://pastebin.com/4mvcMsJu

Answer 1

您需要从父 td 中选择 a 并调用 .text ，文本位于锚点的子集中td：

print([td.a.text for td in soup.find_all('td', class_='tou')])

显然有一个td类或者你不会得到一个没有列表：

In [10]: html = """<td class='tou'>
          <a href="/analyze/default/index/49398962/1/34925733" target="_blank">
       <img alt="" class="ajax-tooltip shadow radius lazy" data-id="acctInfo:34925733_1" data-original="/upload/profileIconId/default.jpg" src="/images/common/transbg.png"/>
       Jue VioIe Grace
      </a>
      </td>"""

In [11]: soup = BeautifulSoup(html,"html.parser")

In [12]: [a.string for a in soup.find_all('td', class_='tou')]
Out[12]: [None]

In [13]: [td.a.text for td in soup.find_all('td', class_='tou')]
Out[13]: [u'\n\n       Jue VioIe Grace\n      ']

您也可以在td：

上调用.text

In [14]: [td.text for td in soup.find_all('td', class_='tou')]
Out[14]: [u'\n\n\n       Jue VioIe Grace\n      \n']

但那可能会比你想要的更多。

使用来自pastebin的完整html：

In [18]: import requests

In [19]: soup = BeautifulSoup(requests.get("http://pastebin.com/raw/4mvcMsJu").content,"html.parser")

In [20]: [td.a.text.strip() for td in soup.find_all('td', class_='tou')]
Out[20]: 
 [u'KElTHMCBRlEF',
 u'game 5 loser',
 u'Cris',
 u'interestingstare',
 u'ApoIlo Price',
 u'Zary',
 u'Adrian Ma',
 u'Liquid Inori',
 u'focus plz',
 u'Shiphtur',
 u'Cody Sun',
 u'ApoIIo Price',
 u'Pobelter',
 u'Jue VioIe Grace',
 u'Valkrin',
 u'Piggy Kitten',
 u'1 and 17',
 u'BLOCK IT',
 u'JiaQQ1035716423',
 u'Twitchtv Flaresz']

在这种情况下，td.text.strip()会为您提供相同的输出：

In [23]: [td.text.strip() for td in soup.find_all('td', class_='tou')]
Out[23]: 
[u'KElTHMCBRlEF',
 u'game 5 loser',
 u'Cris',
 u'interestingstare',
 u'ApoIlo Price',
 u'Zary',
 u'Adrian Ma',
 u'Liquid Inori',
 u'focus plz',
 u'Shiphtur',
 u'Cody Sun',
 u'ApoIIo Price',
 u'Pobelter',
 u'Jue VioIe Grace',
 u'Valkrin',
 u'Piggy Kitten',
 u'1 and 17',
 u'BLOCK IT',
 u'JiaQQ1035716423',
 u'Twitchtv Flaresz']

但你应该明白，存在差异。也是.string vs .text

之间的差异

Answer 2

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('input.html'), 'lxml')
>>> [tag.text.strip() for tag in soup]
[u'Jue VioIe Grace']

如果我们想将搜索限制为 anchor 标记中的文字：

>>> [tag.text.strip() for tag in soup.findAll('a')]
[u'Jue VioIe Grace']

请注意，示例输入中没有td个标记，且没有标记包含属性class_='tou'。

Answer 3

好吧，如果您的soup变量是从那段html代码中删除的，那么您得到的输出是None，因为其中没有td元素，当然还有td元素不是class=tou。

现在，如果您想获取该文字，可以致电soup.findAll(text=True)，输出类似['\n', '\n Jue VioIe Grace\n ']

的内容

在Python中使用BeautifulSoup标记后找不到字符串？

3 个答案: