Question

我想通过使用beautifulsoup从html中提取文本的完全匹配值。但是我得到了一些与我的确切文本几乎匹配的文本。我的代码是：

from bs4 import BeautifulSoup
import urllib2enter code here
url="http://www.somesite.com"
page=urllib2.urlopen(url)
soup=BeautifulSoup(page,"lxml")
for elem in soup(text=re.compile("exact text")):
   print elem

对于上面提到的代码输出如下：

1.exact text
2.almost exact text

如何通过使用beautifulsoup来获得完全匹配？注意：变量（elem）应该是<class 'bs4.element.Comment'>类型

Answer 1

您可以在soup上搜索所需元素，使用tag和任何attribute值。

即：此代码将搜索所有a元素id等于some_id_value。

然后它会loop找到每个元素，测试它的.text值是否等于"exact text"。

如果是这样，它会打印整个element。

for elem in soup.find_all('a', {'id':'some_id_value'}):
    if elem.text == "exact text":
        print(elem)

Answer 2

使用BeautifulSoup的{{1}}方法及其find_all参数。

作为一个例子，我在这里解析维基百科的一个小页面，关于牙买加的一个地方。我寻找所有字符串，其文本是“牙买加存根”，但我希望只找到一个。当我找到它时，我会显示文本及其父文件。

string

第二个想法，在阅读评论后，更好的方法是：

>>> url = 'https://en.wikipedia.org/wiki/Cassava_Piece'
>>> from bs4 import BeautifulSoup
>>> import requests
>>> page = requests.get(url).text
>>> soup = BeautifulSoup(page, 'lxml')
>>> for item in soup.find_all(string="Jamaica stubs"):
...     item
...     item.findParent()
... 
'Jamaica stubs'
<a href="/wiki/Category:Jamaica_stubs" title="Category:Jamaica stubs">Jamaica stubs</a>

我在正则表达式中使用>>> url = 'https://en.wikipedia.org/wiki/Hockey' >>> from bs4 import BeautifulSoup >>> import requests >>> import re >>> page = requests.get(url).text >>> soup = BeautifulSoup(page, 'lxml') >>> for i, item in enumerate(soup.find_all(string=re.compile('women', re.IGNORECASE))): ... i, item.findParent().text[:100] ... (0, "Women's Bandy World Championships") (1, "The governing body is the 126-member International Hockey Federation (FIH). Men's field hockey has b") (2, 'The governing body of international play is the 77-member International Ice Hockey Federation (IIHF)') (3, "women's")，以便在维基百科文章中找到“女性”和“女性”。我在IGNORECASE循环中使用enumerate，以便我可以对显示的项目进行编号，以便于阅读。

使用Beautifulsoup查找文本的完全匹配

2 个答案: