Question

我正在尝试查找包含作者的所有元标记。如果我有一个特定的键和Regex值，它可以工作。当它们都是正则表达式时它不起作用。是否可以在页面中提取包含“author”关键字的所有元标记？这是我写的代码。

from bs4 import BeautifulSoup
page = requests.get(url)
contents = page.content
soup = BeautifulSoup(contents, 'lxml')
preys = soup.find_all("meta", attrs={re.compile('.*'): re.compile('author')})

编辑：为了澄清，我试图解决的问题是，如果值“author”被映射到任何键。正如我在各种示例中看到的那样，该密钥可以是“itemprop”，“name”或甚至“property”。基本上，我的问题是将所有带有作者的元标记作为其中的值，无论该值具有哪个键。案例就是几个例子：

<meta content="Jami Miscik" name="citation_author"/>
<meta content="Will Ripley, Joshua Berlinger and Allison Brennan, CNN" itemprop="author"/>
<meta content="Alison Griswold" property="author"/>

Answer 1

这应该这样做。很遗憾，我找不到一个内容为author meta的网页，以证明此代码的有效性。如果您发现错误，请告诉我。

>>> import requests
>>> import bs4
>>> page = requests.get('http://reference.sitepoint.com/html/meta').text
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> [item.attrs['name'] for item in soup('meta') if item.has_attr('name')]
['robots', 'description']
>>> [item.attrs['name'] for item in soup('meta') if item.has_attr('name') and item.attrs['name'].lower()=='author']
[]

编辑：也适用于Jan的大块html。他的语法更好;用那个。

>>> html = '<meta name="author" content="Anna Lyse"> <meta name="date" content="2010-05-15T08:49:37+02:00">'
>>> soup = bs4.BeautifulSoup(html, 'lxml')
>>> [item.attrs['name'] for item in soup('meta') if item.has_attr('name') and item.attrs['name'].lower()=='author']
['author']

Answer 2

如果您正在寻找citation_author或author，您可能会使用soup.select()和正则表达式的组合：

from bs4 import BeautifulSoup
import re

# some test string
html = '''
<meta name="author" content="Anna Lyse">
<meta name="date" content="2010-05-15T08:49:37+02:00">
<meta itemprop="author" content="2010-05-15T08:49:37+02:00">
<meta rel="author" content="2010-05-15T08:49:37+02:00">
<meta content="Jami Miscik" name="citation_author"/>
<meta content="Will Ripley, Joshua Berlinger and Allison Brennan, CNN" itemprop="author"/>
<meta content="Alison Griswold" property="author"/>
'''

soup = BeautifulSoup(html, 'html5lib')

rx = re.compile(r'(?<=)"(?:citation_)?author"')

authors = [author 
            for author in soup.select("meta")
            if rx.search(str(author))]

print(authors)

Beautifulsoup在元标记中找到特定值

2 个答案: