使用bs4查找特定的链接文本

时间:2014-05-20 14:23:52

标签: python html web-scraping beautifulsoup

我正在尝试抓取一个网站,找到所有Feed的标题。我无法获取所需的a标记文本。这是html的一个例子。

<td class="m" id="b1"><a href="/QSYcfT" id="c1" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic', 'QSYcfT', this.id); this.blur(); return false;">TF4 - Oreos</a> <a href="#" onClick="return lkP('1', 'QSYcfT');" id="x1"><font class="bp">(0)</font></a>
<td class="m" id="b2"><a href="/zXHNvp" id="c2" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=0vjcGwZGBYI', 'zXHNvp', this.id); this.blur(); return false;">Awesome Game Boy Facts</a> <a href="#" onClick="return lkP('2', 'zXHNvp');" id="x2"><font class="bp">(0)</font></a>

我正在尝试获取ID为a的每个c代码的文字,然后在新行上打印。

我的输出应该是这样的。

TF4 - Oreos
Awesome Game Boy Facts

到目前为止,我已经尝试过了。

soup = bs4.BeautifulSoup(html)
links = soup.find_all('a',{'id' : 'c'})
for link in links:
    print link.text

但它找不到或打印任何东西?

3 个答案:

答案 0 :(得分:3)

您可以pass a regular expression代替属性值:

links = soup.find_all('a', {'id': re.compile('^c\d+')})

^表示字符串的开头,\d+匹配一个或多个数字。

演示:

>>> import re
>>> from bs4 import BeautifulSoup
>>> 
>>> html = """
... <tr>
...     <td class="m" id="b1"><a href="/QSYcfT" id="c1" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic', 'QSYcfT', this.id); this.blur(); return false;">TF4 - Oreos</a> <a href="#" onClick="return lkP('1', 'QSYcfT');" id="x1"><font class="bp">(0)</font></a></td>
...     <td class="m" id="b2"><a href="/zXHNvp" id="c2" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=0vjcGwZGBYI', 'zXHNvp', this.id); this.blur(); return false;">Awesome Game Boy Facts</a> <a href="#" onClick="return lkP('2', 'zXHNvp');" id="x2"><font class="bp">(0)</font></a></td>
... </tr>
... """
>>> soup = BeautifulSoup(html)
>>> links = soup.find_all('a', {'id': re.compile('^c\d+')})
>>> for link in links:
...     print link.text
... 
TF4 - Oreos
Awesome Game Boy Facts

答案 1 :(得分:2)

没有a标记,其中包含c属性,但c1c2

links = soup.find_all('a',{'id' : 'c1'})

如果要查找属性以a开头的所有c,则需要传递正则表达式:

import re

links = soup.findAll('a', {'id': re.compile('^c')})

答案 2 :(得分:2)

您可以在调用中将regular expression对象传递给find_all()

import re
import bs4

html = '''
<td class="m" id="b1"><a href="/QSYcfT" id="c1" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic', 'QSYcfT', this.id); this.blur(); return false;">TF4 - Oreos</a> <a href="#" onClick="return lkP('1', 'QSYcfT');" id="x1"><font class="bp">(0)</font></a>
<td class="m" id="b2"><a href="/zXHNvp" id="c2" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=0vjcGwZGBYI', 'zXHNvp', this.id); this.blur(); return false;">Awesome Game Boy Facts</a> <a href="#" onClick="return lkP('2', 'zXHNvp');" id="x2"><font class="bp">(0)</font></a>
'''

soup = bs4.BeautifulSoup(html)
for links in soup.find_all('a', {'id' : re.compile('^c') }):
    print ''.join(links.find_all(text=True))

输出

TF4 - Oreos
Awesome Game Boy Facts