当我通过可读性和scrapy阅读时,我正在尝试检索一些项目。我写了这段代码:
titles = response.xpath("//a[@class='media__link']").extract()
#titles = response.xpath('//a/@href').extract()
print ("%d links was found" %len(titles))
count=0
for title in titles:
item = TutsplusItem()
item["title"] = title
print("Title is : %s" %title)
yield item
titleInner = Document(title)
link = titleInner.xpath("//a/@href")
link = "http://www.bbc.com" + link
response = requests.get(link)
doc = Document(response)
title=doc.xpath("//title/text()")
headline=doc.xpath("//p[@class='story-body__introduction']/text()")
bodyText=doc.xpath("//div[class='story-body__inner']/text()")
但是,当我在此行的可读性文档上运行xpath时出现错误:
link = titleInner.xpath("//a/@href)
错误是:
追踪(最近的呼叫最后):
文件“c:\ python27 \ lib \ site-packages \ scrapy-1.3.1-py2.7.egg \ scrapy \ utils \ defer.py”,第102行,在iter_errback中 产量接下来(it)
文件“c:\ python27 \ lib \ site-packages \ scrapy-1.3.1-py2.7.egg \ scrapy \ spidermiddlewares \ offsite.py”,第29行,在process_spider_output中 对于结果中的x:
文件“c:\ python27 \ lib \ site-packages \ scrapy-1.3.1-py2.7.egg \ scrapy \ spidermiddlewares \ referer.py”,第22行,中
return(结果为r的_set_referer(r)或())
文件“c:\ python27 \ lib \ site-packages \ scrapy-1.3.1-py2.7.egg \ scrapy \ spidermiddlewares \ urllength.py”,第37行,中
return(r表示结果中的r或()if _filter(r))
文件“c:\ python27 \ lib \ site-packages \ scrapy-1.3.1-py2.7.egg \ scrapy \ spidermiddlewares \ depth.py”,第58行,中
return(r表示结果中的r或()if _filter(r))
文件“C:\ Users \ Mehdi \ PycharmProjects \ WebCrawler \ src \ Crawler.py”,第69行,在解析中 link = titleInner.xpath(“// a / @ href”)
TypeError:类型''无法序列化。
我无法解决问题所在?
答案 0 :(得分:0)
我在避免阅读并使用LXML!