我想从以下网站中提取标题和说明:
视图源:http://www.virginaustralia.com/au/en/bookings/flights/make-a-booking/
使用以下代码片段:
<title>Book a Virgin Australia Flight | Virgin Australia
</title>
<meta name="keywords" content="" />
<meta name="description" content="Search for and book Virgin Australia and partner flights to Australian and international destinations." />
我想要标题和元内容。
我使用鹅但是它没有很好地提取。这是我的代码:
website_title = [g.extract(url).title for url in clean_url_data]
和
website_meta_description=[g.extract(urlw).meta_description for urlw in clean_url_data]
结果为空
答案 0 :(得分:10)
请检查BeautifulSoup作为解决方案。
对于上述问题,您可以使用以下代码提取“描述”信息:
import requests
from bs4 import BeautifulSoup
url = 'http://www.virginaustralia.com/au/en/bookings/flights/make-a-booking/'
response = requests.get(url)
soup = BeautifulSoup(response.text)
metas = soup.find_all('meta')
print [ meta.attrs['content'] for meta in metas if 'name' in meta.attrs and meta.attrs['name'] == 'description' ]
输出:
['Search for and book Virgin Australia and partner flights to Australian and international destinations.']
答案 1 :(得分:0)
import lxml
doc = lxml.html.document_fromstring(html_content)
title_element = doc.xpath("//title")
website_title = title_element[0].text_content().strip()
meta_description_element = doc.xpath("//meta[@property='description']")
website_meta_description = meta_description_element[0].text_content().strip()
答案 2 :(得分:0)
导入 metadata_parser
page = metadata_parser.MetadataParser(url='www.xyz.com') metaDesc=page.metadata['og']['description'] 打印(元描述)
答案 3 :(得分:0)
您可以使用 BeautifulSoup 来实现这一点。
应该会有所帮助 -
metas = soup.find_all('meta') #Get Meta Description
for m in metas:
if m.get ('name') == 'description':
desc = m.get('content')
print(desc)