Scrapy明文错误

时间:2015-01-30 05:34:24

标签: python web-scraping scrapy

我正在使用Python Scrapy。我想从网页中提取带有HTML标签的文本。下面是我的代码(从这个页面得到了想法:How can I get all the plain text from a website with Scrapy?

sel = Selector(response)
        item = DeletespiderItem()
        item['url'] =  response.url
        description = sel.select("//body").extract()
        tree = lxml.html.fromstring(description)
        item['description'] = tree.text_content().strip()
        yield item

但我收到以下错误

File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 722, in fromstring
        is_full_html = _looks_like_full_html_unicode(html)
    exceptions.TypeError: expected string or buffer

我的代码出了什么问题。我如何从中获取纯文本?

任何人都可以帮助我吗?谢谢,

更新:

Scapy shell https://stackoverflow.com/questions/23156780/how-can-i-get-all-the-plain-text-from-a-website-with-scrapy

sel.select("//body").extract()[0].strip()

o / p \ r \ n \ r \ n \ r \ n \ r \ n \ r \ n \ r \ n聊天\ r \ n]

它正在添加额外的\ r \ n?

1 个答案:

答案 0 :(得分:1)

extract()返回一个列表,使用:

description = sel.select("//body").extract()[0]