我正在尝试使用Beautiful Soup通过RegEx在页面的链接中搜索文本中的关键字。
<a href="/company/05835190" onclick="javascript:_paq.push(['trackEvent', 'SearchSuggestions', 'SearchResult-1' ]);" title="View company"><strong>FOO</strong>blah blah<strong>BAR</strong>example</a>
这是我的简单代码:
raw_html = simple_get(searchString)
searchString = ...see below...
if len(str(raw_html)) != 0:
html = BeautifulSoup(raw_html, 'html.parser')
companyLink = html.find_all('a', string=re.compile(searchString,
re.IGNORECASE))
print(companyLink)
假设链接文本为:Foo blah blah bar:
我尝试了companyLink = html.find_all('a',text = re.compile(searchString,re.IGNORECASE))-不行
我尝试了companyLink = html.find_all(string = re.compile(searchString,re.IGNORECASE))-不行
我尝试了companyLink = html.find_all(text = re.compile(searchString,re.IGNORECASE))-不行
我认为searchString的所有四个版本都应返回一个匹配项。我真的想使用版本3,因为我知道“ Foo”和“ Bar”,并且不确定之间可能放置什么。
关于我在做什么错的任何想法?
答案 0 :(得分:0)
好的-这是我的解决方法-感谢@WiktorStribiżew帮助破解了代码:
<a href="/company/05835190" onclick="javascript:_paq.push(['trackEvent', 'SearchSuggestions', 'SearchResult-1' ]);" title="View company"><strong>FOO</strong>blah blah<strong>BAR</strong>example</a>
companyName = 'Foo Bar'
searchTarget = re.compile(companyName, re.IGNORECASE)
raw_html = simple_get(searchString)
if len(str(raw_html)) != 0:
#run BeautifulSoup on the page HTML
html = BeautifulSoup(raw_html, 'html.parser')
#find all links where title='View company'
anchorTitle = html.find_all('a', title='View company')
#convert to string, remove <strong> tags
a_tag = str(anchorTitle).replace('<strong>', '')
#remove </strong> tags
a_tag = a_tag.replace('</strong>', '')
#remove multiple spaces
a_tag = re.sub('( ){2,}',' ', a_tag)
#run BeautifulSoup again on the edited string
b_tag = BeautifulSoup(a_tag, 'html.parser')
#find link where string = 'Foo Bar'
anchorText = b_tag.find('a', string=searchTarget)
print(anchorText)