想象一下,我的内容中包含带元标记的html,如
<meta property="og:country-name" content="South Africa"/>
问题是,我需要从整页的html标记中获取国家的名称
from bs4 import BeautifulSoup as BS
url ="mydomain.com"
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
soup = BS(data)
print soup.findAll(...
无法弄清楚下一步必须是什么。有什么建议吗?
答案 0 :(得分:2)
搜索具有特定属性的<meta>
标记:
country_meta = soup.find('meta', attrs={'property': 'og:country-name', 'content': True})
if country_meta:
country = country_meta['content']
演示:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <html><head>
... <meta property="og:country-name" content="South Africa"/>
... <title>Foo</title>
... </head><body></body></html>''')
>>> country_meta = soup.find('meta', attrs={'property': 'og:country-name', 'content': True})
>>> country_meta
<meta content="South Africa" property="og:country-name"/>
>>> print country_meta['content']
South Africa