Question

当我通过可读性和scrapy阅读时，我正在尝试检索一些项目。我写了这段代码：

titles = response.xpath("//a[@class='media__link']").extract()
    #titles = response.xpath('//a/@href').extract()
    print ("%d links was found" %len(titles))


    count=0
    for title in titles:
      item = TutsplusItem()
      item["title"] = title
      print("Title is : %s" %title)
      yield item
      titleInner = Document(title)
      link = titleInner.xpath("//a/@href")
      link = "http://www.bbc.com" + link
      response = requests.get(link)
      doc = Document(response)

      title=doc.xpath("//title/text()")
      headline=doc.xpath("//p[@class='story-body__introduction']/text()")
      bodyText=doc.xpath("//div[class='story-body__inner']/text()")

但是，当我在此行的可读性文档上运行xpath时出现错误：

link = titleInner.xpath("//a/@href)

错误是：

追踪（最近的呼叫最后）：
  文件“c：\ python27 \ lib \ site-packages \ scrapy-1.3.1-py2.7.egg \ scrapy \ utils \ defer.py”，第102行，在iter_errback中   产量接下来（it）
  文件“c：\ python27 \ lib \ site-packages \ scrapy-1.3.1-py2.7.egg \ scrapy \ spidermiddlewares \ offsite.py”，第29行，在process_spider_output中   对于结果中的x：
  文件“c：\ python27 \ lib \ site-packages \ scrapy-1.3.1-py2.7.egg \ scrapy \ spidermiddlewares \ referer.py”，第22行，中
  return（结果为r的_set_referer（r）或（））
  文件“c：\ python27 \ lib \ site-packages \ scrapy-1.3.1-py2.7.egg \ scrapy \ spidermiddlewares \ urllength.py”，第37行，中
  return（r表示结果中的r或（）if _filter（r））
  文件“c：\ python27 \ lib \ site-packages \ scrapy-1.3.1-py2.7.egg \ scrapy \ spidermiddlewares \ depth.py”，第58行，中
  return（r表示结果中的r或（）if _filter（r））
  文件“C：\ Users \ Mehdi \ PycharmProjects \ WebCrawler \ src \ Crawler.py”，第69行，在解析中   link = titleInner.xpath（“// a / @ href”）
  TypeError：类型''无法序列化。

我无法解决问题所在？

Answer 1

我在避免阅读并使用LXML！

可重复性IXML xpath不起作用

1 个答案: