Question

使用BeautifulSoup，我想只在其href字符串中返回包含“Company”而不是“Sector”的“a”标签。有没有办法在re.compile（）中使用正则表达式只返回公司而不是部门？

代码：

soup = soup.findAll('tr')[5].findAll('a') print(soup)

输出

[<a class="example" href="../ref/index.htm">Example</a>,  
<a href="?Company=FB">Facebook</a>,  
<a href="?Company=XOM">Exxon</a>,  
<a href="?Sector=5">Technology</a>,  
<a href="?Sector=3">Oil & Gas</a>]

使用此方法：

import re soup.findAll('a', re.compile("Company"))

返回：

AttributeError: 'ResultSet' object has no attribute 'findAll'

但我希望它返回（没有Sectors）：

[<a href="?Company=FB">Facebook</a>, <a href="?Company=XOM">Exxon</a>]

使用：

Urllib.request版本：3.5
BeautifulSoup版本：4.4.1
熊猫版：0.17.1
Python 3

Answer 1

使用soup = soup.findAll('tr')[5].findAll('a')然后soup.findAll('a', re.compile("Company"))写入原始汤变量。 findAll返回一个ResultSet，它基本上是一个BeautifulSoup对象的数组。请尝试使用以下内容来获取所有“公司”链接。

links = soup.findAll('tr')[5].findAll('a', href=re.compile("Company"))

要获取这些标记中包含的文字：

companies = [link.text for link in links]

Answer 2

另一种方法是xpath，它支持AND / NOT操作，以便按XML文档中的属性进行查询。不幸的是，BeautifulSoup本身并不处理xpath，但lxml可以：

from lxml.html import fromstring
import requests

r = requests.get("YourUrl")
tree = fromstring(r.text)
#get elements with company in the URL but excludes ones with Sector
a_tags = tree.xpath("//a[contains(@href,'?Company') and not(contains(@href, 'Sector'))]")

Answer 3

您可以使用 css选择器获取href以for url in re.findall(r'"(http[^"]+).*360p"', elt.text): print(url)开头的所有标记：

?Company

如果您只想从第六个tr中获得它们，您可以使用 nth-of-type ：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

a = soup.select("a[href^=?Company]")

Answer 4

感谢上述答案@Padriac Cunningham和@Wyatt I !!这是我提出的不太优雅的解决方案：

import re
for i in range(1, len(soup)):
    if re.search("Company" , str(soup[i])):
        print(soup[i])

使用Python的BeautifulSoup提取包含特定子字符串的'a'标签

4 个答案: