提取内容

时间:2014-10-04 12:17:20

标签: python python-2.7 beautifulsoup

1 /我正在尝试使用漂亮的汤提取脚本的一部分,但它打印无。怎么了?

URL = "http://www.reuters.com/video/2014/08/30/woman-who-drank-restaurants-tainted-tea?videoId=341712453"
oururl= urllib2.urlopen(URL).read()
soup = BeautifulSoup(oururl)

for script in soup("script"):
        script.extract()

list_of_scripts = soup.findAll("script")
print list_of_scripts

2 /目标是提取属性“transcript”的值:

<script type="application/ld+json">
{
    "@context": "http://schema.org",
    "@type": "VideoObject",
    "video": {
        "@type": "VideoObject",
        "headline": "Woman who drank restaurant&#039;s tainted tea hopes for industry...",
        "caption": "Woman who drank restaurant&#039;s tainted tea hopes for industry...",  
        "transcript": "Jan Harding is speaking out for the first time about the ordeal that changed her life.               SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING:               \"Immediately my whole mouth was on fire.\"               The Utah woman was critically burned in her mouth and esophagus after taking a sip of sweet tea laced with a toxic cleaning solution at Dickey's BBQ.               SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING:               \"It was like a fire beyond anything you can imagine. I mean, it was not like drinking hot coffee.\"               Authorities say an employee mistakenly mixed the industrial cleaning solution containing lye into the tea thinking it was sugar.               The Hardings hope the incident will bring changes in the restaurant industry to avoid such dangerous mixups.               SOUNDBITE: JIM HARDING, HUSBAND, SAYING:               \"Bottom line, so no one ever has to go through this again.\"               The district attorney's office is expected to decide in the coming week whether criminal charges will be filed.",

2 个答案:

答案 0 :(得分:21)

来自documentation

从Beautiful Soup版本4.9.0开始,使用lxml或html.parser时,<script><style><template>标签的内容不被视为'文字”,因为这些标记不是页面上人类可见内容的一部分。

因此,基本上,上面 falsetru 所接受的答案都不错,但是在较新版本的Beautiful Soup中使用.string而不是.text,否则您会感到困惑我当时.text总是为None标签返回<script>

答案 1 :(得分:18)

extract从dom中删除标记。这就是你得到空名单的原因。


使用script属性查找type="application/ld+json"并使用json.loads对其进行解码。然后,您可以访问Python数据结构等数据。 (dict给定数据)

import json
import urllib2

from bs4 import BeautifulSoup

URL = ("http://www.reuters.com/video/2014/08/30/"
       "woman-who-drank-restaurants-tainted-tea?videoId=341712453")
oururl= urllib2.urlopen(URL).read()
soup = BeautifulSoup(oururl)

data = json.loads(soup.find('script', type='application/ld+json').text)
print data['video']['transcript']