有没有办法根据HTML中包含的单词从HTML页面中提取特定的<li>
?
例如: 我们来看看这个页面:https://en.wikipedia.org/wiki/1916
我在Python中获取此页面的HTML,如下所示:
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('https://en.wikipedia.org/wiki/1916')
我想要的是获得包含给定单词的每个<li>
。如果我搜索&#39;凡尔登&#39;我希望所有<li>
及其内容都包含
<li><a href="/wiki/February_21" title="February 21">February 21</a> – WWI: The <a href="/wiki/Battle_of_Verdun" title="Battle of Verdun">Battle of Verdun</a> begins in <a href="/wiki/French_Third_Republic" title="French Third Republic">France</a>.</li>
答案 0 :(得分:1)
你可以这样做,
soup = BeautifulSoup(html)
print([i for i in soup.select('li') if 'verdun' in i])
答案 1 :(得分:1)
BeautifulSoup允许您按部分文本进行搜索。只需执行以下操作:
import re
soup = BeautifulSoup(html)
lis = soup.find_all('li', text=re.compile('verdun'))
# Now the lis contain a ResultSet (list) of all li tags with 'verdun' as text
for li in lis:
print li.text