使用BeautifulSoup从未关闭的特定元标记中提取内容

时间:2013-08-08 19:20:32

标签: python beautifulsoup

我正在尝试解析特定元标记中的内容。这是元标记的结构。前两个用反斜杠关闭,但其余的没有任何结束标记。只要我获得第3个元标记,就会返回<head>标记之间的全部内容。我也尝试了soup.findAll(text=re.compile('keyword')),但由于关键字是元标记的属性,因此不会返回任何内容。

<meta name="csrf-param" content="authenticity_token"/>
<meta name="csrf-token" content="OrpXIt/y9zdAFHWzJXY2EccDi1zNSucxcCOu8+6Mc9c="/>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'>
<meta content='en_US' http-equiv='Content-Language'>
<meta content='c2y_K2CiLmGeet7GUQc9e3RVGp_gCOxUC4IdJg_RBVo' name='google-site-    verification'>
<meta content='initial-scale=1.0,maximum-scale=1.0,width=device-width' name='viewport'>
<meta content='notranslate' name='google'>
<meta content="Learn about Uber's product, founders, investors and team. Everyone's Private Driver - Request a car from any mobile phone—text message, iPhone and Android apps. Within minutes, a professional driver in a sleek black car will arrive curbside. Automatically charged to your credit card on file, tip included." name='description'>

以下是代码:

import csv
import re
import sys
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

req3 = Request("https://angel.co/uber", headers={'User-Agent': 'Mozilla/5.0')
page3 = urlopen(req3).read()
soup3 = BeautifulSoup(page3)

## This returns the entire web page since the META tags are not closed
desc = soup3.findAll(attrs={"name":"description"}) 

5 个答案:

答案 0 :(得分:21)

编辑: 根据@Albert Chen的建议添加了区分大小写的正则表达式。

虽然我不确定它是否适用于每一页:

from bs4 import BeautifulSoup
import re
import urllib

page3 = urllib.urlopen("https://angel.co/uber").read()
soup3 = BeautifulSoup(page3)

desc = soup3.findAll(attrs={"name": re.compile(r"description", re.I)}) 
print(desc[0]['content'].encode('utf-8'))

收率:

Learn about Uber's product, founders, investors and team. Everyone's Private Dri
ver - Request a car from any mobile phoneΓÇötext message, iPhone and Android app
s. Within minutes, a professional driver in a sleek black car will arrive curbsi
de. Automatically charged to your credit card on file, tip included.

答案 1 :(得分:4)

描述是Case-Sensitive.So,我们需要寻找&#39;描述&#39;和&#39;描述&#39;。

案例1:&#39;说明&#39;在Flipkart.com

案例2:&#39;描述&#39;在Snapdeal.com

from bs4 import BeautifulSoup
import requests

url= 'https://www.flipkart.com'
page3= requests.get(url)
soup3= BeautifulSoup(page3.text)
desc= soup3.find(attrs={'name':'Description'})
if desc == None:
    desc= soup3.find(attrs={'name':'description'})
try:
    print desc['content']
except Exception as e:
    print '%s (%s)' % (e.message, type(e))

答案 2 :(得分:2)

尝试(基于this博文)

from bs4 import BeautifulSoup
...
desc = ""
for meta in soup.findAll("meta"):
    metaname = meta.get('name', '').lower()
    metaprop = meta.get('property', '').lower()
    if 'description' == metaname or metaprop.find("description")>0:
        desc = meta['content'].strip()

针对以下变体进行测试:

  • <meta name="description" content="blah blah" />Example
  • <meta id="MetaDescription" name="DESCRIPTION" content="blah blah" />Example
  • <meta property="og:description" content="blah blah" />Example

使用过BeautifulSoup版本 4.4.1

答案 3 :(得分:1)

我认为在这里使用regexp会更好: 例如:

resp = requests.get('url')
soup = BeautifulSoup(resp.text)
desc = soup.find_all(attrs={"name": re.compile(r'Description', re.I)})

答案 4 :(得分:0)

正如ingo所建议的,您可以使用不太严格的解析器,如html5。

soup3 = BeautifulSoup(page3, 'html5lib')

但请确保系统上有python-html5lib解析器。