为什么我定义的项目没有填充并存储在Scrapy中?

时间:2013-07-22 19:01:13

标签: python html-parsing web-scraping web-crawler scrapy

假设我有以下网站结构:

  1. 起始网址:http://thomas.loc.gov/cgi-bin/query/z?c107:H.R%s :其中%s 是索引1-50(用于说明目的的示例)。
  2. “第一层”:比尔文字或链接到多个版本......
  3. “第二层”:Bill Text w /链接到“Printer friendly”(纯文本)版本。
  4. 脚本的最终目标:

    1. 浏览起始网址;解析URL,标题&身体;将它们保存到starts.txt文件
    2. 从起始网址中提取“第一层”链接;导航到这些链接;解析URL,标题&身体;将它们保存到bills.txt文件
    3. 从“第一层”网址的主体中提取“第二层”链接;导航到这些链接;解析网址,标题&身体;将它们保存到versions.txt文件
    4. 假设我有以下脚本:

      from scrapy.item import Item, Field
      from scrapy.contrib.spiders import CrawlSpider, Rule
      from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
      from scrapy.selector import HtmlXPathSelector
      
      class StartItem(Item):
          url = Field()
          title = Field()
          body = Field()
      
      class BillItem(Item):
          url = Field()
          title = Field()
          body = Field()
      
      class VersionItem(Item):
          url = Field()
          title = Field()
          body = Field()
      
      class Lrn2CrawlSpider(CrawlSpider):
          name = "lrn2crawl"
          allowed_domains = ["thomas.loc.gov"]
          start_urls = ["http://thomas.loc.gov/cgi-bin/query/z?c107:H.R.%s:" % bill for bill in xrange(000001,00050,00001) ### Sample of 40 bills; Total range of bills is 1-5767
      
          ]
      
          rules = (
                  # Extract links matching /query/D fragment (restricting tho those inside the content body of the url); follow; & scrape all bill text.
                  # and follow links from them (since no callback means follow=True by default).
                  # Desired result: scrape all bill text & in the event that there are multiple versions, follow them & parse.
                  Rule(SgmlLinkExtractor(allow=(r'/query/D'), restrict_xpaths=('//div[@id="content"]')), callback='parse_bills', follow=True),
      
                  # Extract links in the body of a bill-version & follow them.
                 #Desired result: scrape all version text & in the event that there are multiple sections, follow them & parse.
                  Rule(SgmlLinkExtractor(allow=(r'/query/C'), restrict_xpaths=('//table/tr/td[2]/a/@href')), callback='parse_versions', follow=True)
              )
      
          def parse_start_url(self, response):
              hxs = HtmlXPathSelector(response)
              starts = hxs.select('//div[@id="content"]')
              scraped_starts = []
              for start in starts:
                  scraped_start = StartItem() ### Start object defined previously
                  scraped_start['url'] = response.url
                  scraped_start['title'] = start.select('//h1/text()').extract()
                  scraped_start['body'] = response.body
                  scraped_starts.append(scraped_start)
                  with open('starts.txt', 'a') as f:
                      f.write('url: {0}, title: {1}, body: {2}\n'.format(scraped_start['url'], scraped_start['title'], scraped_start['body']))
              return scraped_starts
      
          def parse_bills(self, response):
              hxs = HtmlXPathSelector(response)
              bills = hxs.select('//div[@id="content"]')
              scraped_bills = []
              for bill in bills:
                  scraped_bill = BillItem() ### Bill object defined previously
                  scraped_bill['url'] = response.url
                  scraped_bill['title'] = bill.select('//h1/text()').extract()
                  scraped_bill['body'] = response.body
                  scraped_bills.append(scraped_bill)
                  with open('bills.txt', 'a') as f:
                      f.write('url: {0}, title: {1}, body: {2}\n'.format(scraped_bill['url'], scraped_bill['title'], scraped_bill['body']))
              return scraped_bills
      
          def parse_versions(self, response):
              hxs = HtmlXPathSelector(response)
              versions = hxs.select('//div[@id="content"]')
              scraped_versions = []
              for version in versions:
                  scraped_version = VersionItem() ### Version object defined previously
                  scraped_version['url'] = response.url
                  scraped_version['title'] = version.select('//h1/text()').extract()
                  scraped_version['body'] = response.body
                  scraped_versions.append(scraped_version)
                  with open('versions.txt', 'a') as f:
                      f.write('url: {0}, title: {1}, body: {2}\n'.format(scraped_version['url'], scraped_version['title'], scraped_version['body']))
              return scraped_versions
      

      除了导航到“第二层”链接并解析这些网站的项目(URL,标题和主体)之外,此脚本似乎正在执行我想要的所有操作。换句话说,Scrapy不会抓取或解析我的“第二层”。

      更简单地重述我的问题:为什么Scrapy不会填充我的VersionItem并将其输出到我想要的文件:version.txt?

1 个答案:

答案 0 :(得分:1)

问题出在第二个restrict_xpaths的{​​{1}}设置中。将其更改为:

SgmlLinkExtractor

希望有所帮助。