Question

我正试图从http://feeds.reuters.com/~r/reuters/technologyNews/~3/ZyAuZq5Cbz0/story01.htm

获取Body-Tag

但是BeautifulSoup找不到它。这是因为HTML无效吗？如果是这样，我该如何防止这种情况？

我还尝试使用PyTidyLib（http://countergram.com/open-source/pytidylib/docs/index.html）

为HTML-Errors添加前缀

以下是一些代码：

def getContent(url, parser="lxml"):
    request = urllib2.Request(url)  
    try:    
        response = opener.open(request).read()
    except:
        print 'EMPTY CONTENT',url
        return None
    doc, errors = tidy_document(response)
    return parse(url, doc)

def parse(url, response, parser="lxml"):
    try:
        soup = bs(response,parser)
    except UnicodeDecodeError as e:
        if parser=="lxml":
            return parse(url, response, "html5lib")
        else:
            print e,url
            print 'EMPTY CONTENT',url
            return None  

    body = soup.body
    ...

当我打印出汤时，我可以看到开启和关闭的身体 - 标签，但是在身体=汤。身体后，我得到无。

我正在使用Python 2.7.3和BeautifulSoup4 它似乎与BeautifulSoup3一起使用，但由于性能问题，我需要坚持使用BS4。

Answer 1

我终于让它运行了。这是代码：

import urllib2
from lxml import html

url = "http://www.reuters.com/article/2013/04/17/us-usa-immigration-tech-idUSBRE93F1DL20130417?feedType=RSS&feedName=technologyNews"
response = urllib2.urlopen(url).read().decode("utf-8")
test = html.fromstring(response)

for p in test.body.iter('p'):
    print p.text_content()

从BeautifulSoup中提取标签

1 个答案: