我期待代码:
html = """
<th scope="row">Fruits<br />
<i><a href="#Fruits">Buy</a></i></th>
<td><a href="banana.html" color="yellow">Banana</a><br />
<a href="kiwi.html" color="green">Kiwi</a><br />
<a href="Persimmon" color="orange">Persimmon</a><br />
</tr>
"""
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
th_node = soup.find('th', { 'scope' : 'row' }, text = re.compile('^Fruits'))
td_node = th_node.find('td')
fruits = td_node.find_all('a')
for f in fruits:
print f['color'], " ", f.text
要打印:
yellow banana
green kiwi
orange Persimmon
我出错了什么?
答案 0 :(得分:2)
你做错了因为:
th_node = soup.find('th', { 'scope' : 'row' }, text = re.compile('^Fruits'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
来自this answer:
您需要使用混合方法,因为当元素具有子元素和文本时,
text=
将失败。
例如:
>>> a = '<th scope="row">foo</th>'
>>> b = '<th scope="row">foo<td>bar</td></th>'
>>> BeautifulSoup(a, "html.parser").find('th', {'scope': 'row'}, text='foo')
<th scope="row">foo</th>
>>> BeautifulSoup(b, "html.parser").find('th', {'scope': 'row'}, text='foo')
>>> BeautifulSoup(b, "html.parser").find('th', {'scope': 'row'}, text='foobar')
请参阅td
标记中th
标记时,BeautifulSoup失败。所以我们需要(这个想法也来自那个答案):
html = """
<th scope="row">Fruits<br />
<i><a href="#Fruits">Buy</a></i></th>
<td><a href="banana.html" color="yellow">Banana</a><br />
<a href="kiwi.html" color="green">Kiwi</a><br />
<a href="Persimmon" color="orange">Persimmon</a><br />
</tr>
"""
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
reg = re.compile(r'^Fruits')
th_node = [e for e in soup.find_all(
'th', {'scope': 'row'}) if reg.search(e.text)][0]
print th_node
输出:
<th scope="row">Fruits<br/>
<i><a href="#Fruits">Buy</a></i></th>
是的,这不是您想要的,因为td
标记不在th
标记内。所以现在我们可以像这样使用tag.find_next()
方法:
html = """
<th scope="row">Fruits<br />
<i><a href="#Fruits">Buy</a></i></th>
<td><a href="banana.html" color="yellow">Banana</a><br />
<a href="kiwi.html" color="green">Kiwi</a><br />
<a href="Persimmon" color="orange">Persimmon</a><br />
</tr>
"""
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
reg = re.compile(r'^Fruits')
th_node = [e for e in soup.find_all(
'th', {'scope': 'row'}) if reg.search(e.text)][0]
td_node = th_node.find_next('td')
fruits = td_node.find_all('a')
for f in fruits:
print f['color'], " ", f.text
输出:
yellow Banana
green Kiwi
orange Persimmon
然后我们完成了!
答案 1 :(得分:0)
如果您需要检查attrs
节点值,则可以仅使用lambda
(简单)或混合使用attrs
和text
-
html = """
<th scope="row">Fruits<br />
<i><a href="#Fruits">Buy</a></i></th>
<td><a href="banana.html" color="yellow">Banana</a><br />
<a href="kiwi.html" color="green">Kiwi</a><br />
<a href="Persimmon" color="orange">Persimmon</a><br />
</tr>
"""
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
th_node = soup.find('th', { 'scope' : 'row' })#OR th_node = soup.find('th', { 'scope' : 'row' },lambda x: x.text.startswith('Fruits'))
td_node = th_node.findNext('td')
fruits = td_node.find_all('a')
for f in fruits:
print f['color'], " ", f.text
答案 2 :(得分:0)
您需要将class
添加到href
元素,正确的源代码如下:
from bs4 import BeautifulSoup
html = ""
html += "<table><th scope='row'>Fruits<br /><i><a href='#Fruits'>Buy</a></i></th>"
html += "<tr><td><a class='fruits' href='banana.html' color='yellow'>Banana</a><br/>"
html += "<a class='fruits' href='kiwi.html' color='green'>Kiwi</a><br/>"
html += "<a class='fruits' href='Persimmon' color='orange'>Persimmon</a><br/>"
html += "</tr></table>"
soup = BeautifulSoup(html,"html.parser")
for link in soup.findAll('a',{'class':'fruits'}):
col = link.get('color')
name = link.string
print(col + " " + name)
答案 3 :(得分:0)
它不起作用的原因是beautifulsoup
正在比较你的正则表达式:
>>> def f(s):
... print "comparing", s
...
>>> soup.find("th", text=f)
comparing None
None