使用RegEx和Beautiful Soup查找锚文本-RegEx无法正常工作

时间:2018-10-08 22:18:58

标签: regex python-3.x beautifulsoup

我正在尝试使用Beautiful Soup通过RegEx在页面的链接中搜索文本中的关键字。

<a href="/company/05835190" onclick="javascript:_paq.push(['trackEvent', 'SearchSuggestions', 'SearchResult-1' ]);" title="View company"><strong>FOO</strong>blah blah<strong>BAR</strong>example</a>

这是我的简单代码:

raw_html = simple_get(searchString) 
searchString = ...see below...

if len(str(raw_html)) != 0:
    html = BeautifulSoup(raw_html, 'html.parser')
    companyLink = html.find_all('a', string=re.compile(searchString, 
    re.IGNORECASE))
    print(companyLink)

假设链接文本为:Foo blah blah bar:

  1. 如果searchString =“ Foo”->匹配
  2. 如果searchString =“ Bar”->匹配
  3. 如果searchString =“ Foo(。)* Bar”->不匹配
  4. 如果searchString =“ Foo blah blah bar”->没有匹配! (我什至直接从网页上复制了文本)

我尝试了companyLink = html.find_all('a',text = re.compile(searchString,re.IGNORECASE))-不行

我尝试了companyLink = html.find_all(string = re.compile(searchString,re.IGNORECASE))-不行

我尝试了companyLink = html.find_all(text = re.compile(searchString,re.IGNORECASE))-不行

我认为searchString的所有四个版本都应返回一个匹配项。我真的想使用版本3,因为我知道“ Foo”和“ Bar”,并且不确定之间可能放置什么。

关于我在做什么错的任何想法?

1 个答案:

答案 0 :(得分:0)

好的-这是我的解决方法-感谢@WiktorStribiżew帮助破解了代码:

<a href="/company/05835190" onclick="javascript:_paq.push(['trackEvent', 'SearchSuggestions', 'SearchResult-1' ]);" title="View company"><strong>FOO</strong>blah blah<strong>BAR</strong>example</a>

companyName = 'Foo Bar'
searchTarget = re.compile(companyName, re.IGNORECASE)
raw_html = simple_get(searchString)

if len(str(raw_html)) != 0:
#run BeautifulSoup on the page HTML
   html = BeautifulSoup(raw_html, 'html.parser')
#find all links where title='View company'
   anchorTitle = html.find_all('a', title='View company')
#convert to string, remove <strong> tags
   a_tag = str(anchorTitle).replace('<strong>', '')
#remove </strong> tags
   a_tag = a_tag.replace('</strong>', '')
#remove multiple spaces
   a_tag = re.sub('( ){2,}',' ', a_tag)
#run BeautifulSoup again on the edited string
   b_tag = BeautifulSoup(a_tag, 'html.parser')
#find link where string = 'Foo Bar'
   anchorText = b_tag.find('a', string=searchTarget)

   print(anchorText)