Question

我正在尝试使用Beautiful Soup通过RegEx在页面的链接中搜索文本中的关键字。

<a href="/company/05835190" onclick="javascript:_paq.push(['trackEvent', 'SearchSuggestions', 'SearchResult-1' ]);" title="View company"><strong>FOO</strong>blah blah<strong>BAR</strong>example</a>

这是我的简单代码：

raw_html = simple_get(searchString) 
searchString = ...see below...

if len(str(raw_html)) != 0:
    html = BeautifulSoup(raw_html, 'html.parser')
    companyLink = html.find_all('a', string=re.compile(searchString, 
    re.IGNORECASE))
    print(companyLink)

假设链接文本为：Foo blah blah bar：

如果searchString =“ Foo”->匹配
如果searchString =“ Bar”->匹配
如果searchString =“ Foo（。）* Bar”->不匹配
如果searchString =“ Foo blah blah bar”->没有匹配！（我什至直接从网页上复制了文本）

我尝试了companyLink = html.find_all（'a'，text = re.compile（searchString，re.IGNORECASE））-不行

我尝试了companyLink = html.find_all（string = re.compile（searchString，re.IGNORECASE））-不行

我尝试了companyLink = html.find_all（text = re.compile（searchString，re.IGNORECASE））-不行

我认为searchString的所有四个版本都应返回一个匹配项。我真的想使用版本3，因为我知道“ Foo”和“ Bar”，并且不确定之间可能放置什么。

关于我在做什么错的任何想法？

Answer 1

好的-这是我的解决方法-感谢@WiktorStribiżew帮助破解了代码：

<a href="/company/05835190" onclick="javascript:_paq.push(['trackEvent', 'SearchSuggestions', 'SearchResult-1' ]);" title="View company"><strong>FOO</strong>blah blah<strong>BAR</strong>example</a>

companyName = 'Foo Bar'
searchTarget = re.compile(companyName, re.IGNORECASE)
raw_html = simple_get(searchString)

if len(str(raw_html)) != 0:
#run BeautifulSoup on the page HTML
   html = BeautifulSoup(raw_html, 'html.parser')
#find all links where title='View company'
   anchorTitle = html.find_all('a', title='View company')
#convert to string, remove <strong> tags
   a_tag = str(anchorTitle).replace('<strong>', '')
#remove </strong> tags
   a_tag = a_tag.replace('</strong>', '')
#remove multiple spaces
   a_tag = re.sub('( ){2,}',' ', a_tag)
#run BeautifulSoup again on the edited string
   b_tag = BeautifulSoup(a_tag, 'html.parser')
#find link where string = 'Foo Bar'
   anchorText = b_tag.find('a', string=searchTarget)

   print(anchorText)

使用RegEx和Beautiful Soup查找锚文本-RegEx无法正常工作

1 个答案: