为什么Scrapy不会爬行或解析?

时间:2013-07-12 00:58:02

标签: python python-2.7 web-scraping scrapy

我正试图刮掉国会图书馆/托马斯网站。此Python脚本旨在从其站点访问40个帐单的样本(URL中的#1-40标识符)。我想解析每个立法的正文,搜索正文/内容,提取潜在的多个版本的链接&跟随。

在版本页面上,我想解析每一条立法的正文,搜索正文/内容&提取潜在部分的链接&跟随。

一旦进入部分页面,我想解析账单每一部分的正文。

我认为我的代码的Rules / LinkExtractor段存在一些问题。 python代码正在执行,抓取起始URL,但不解析或任何后续任务。

三个问题:

  1. 有些账单没有多个版本(并且URL的正文部分没有链接
  2. 有些账单没有链接部分,因为它们很短,而有些只是链接到部分。
  3. 某些部分链接不仅包含特定于部分的内容,而且大部分内容只是包含先前或后续部分内容的多余内容。
  4. 我的问题是,为什么Scrapy不会抓取或解析?

    from scrapy.item import Item, Field
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    
    class BillItem(Item):
        title = Field()
        body = Field()
    
    class VersionItem(Item):
        title = Field()
        body = Field()
    
    class SectionItem(Item):
        body = Field()
    
    class Lrn2CrawlSpider(CrawlSpider):
        name = "lrn2crawl"
        allowed_domains = ["thomas.loc.gov"]
        start_urls = ["http://thomas.loc.gov/cgi-bin/query/z?c107:H.R.%s:" % bill for bill in xrange(000001,00040,00001) ### Sample of 40 bills; Total range of bills is 1-5767
    
        ]
    
    rules = (
            # Extract links matching /query/ fragment (restricting tho those inside the content body of the url)
            # and follow links from them (since no callback means follow=True by default).
            # Desired result: scrape all bill text & in the event that there are multiple versions, follow them & parse.
            Rule(SgmlLinkExtractor(allow=(r'/query/'), restrict_xpaths=('//div[@id="content"]')), callback='parse_bills', follow=True),
    
            # Extract links in the body of a bill-version & follow them.
           #Desired result: scrape all version text & in the event that there are multiple sections, follow them & parse.
            Rule(SgmlLinkExtractor(restrict_xpaths=('//div/a[2]')), callback='parse_versions', follow=True)
        )
    
    def parse_bills(self, response):
        hxs = HtmlXPathSelector(response)
        bills = hxs.select('//div[@id="content"]')
        scraped_bills = []
        for bill in bills:
            scraped_bill = BillItem() ### Bill object defined previously
            scraped_bill['title'] = bill.select('p/text()').extract()
            scraped_bill['body'] = response.body
            scraped_bills.append(scraped_bill)
        return scraped_bills
    
    def parse_versions(self, response):
        hxs = HtmlXPathSelector(response)
        versions = hxs.select('//div[@id="content"]')
        scraped_versions = []
        for version in versions:
            scraped_version = VersionItem() ### Version object defined previously
            scraped_version['title'] = version.select('center/b/text()').extract()
            scraped_version['body'] = response.body
            scraped_versions.append(scraped_version)
        return scraped_versions
    
    def parse_sections(self, response):
        hxs = HtmlXPathSelector(response)
        sections = hxs.select('//div[@id="content"]')
        scraped_sections = []
        for section in sections:
            scraped_section = SectionItem() ## Segment object defined previously
            scraped_section['body'] = response.body
            scraped_sections.append(scraped_section)
        return scraped_sections
    
    spider = Lrn2CrawlSpider()
    

2 个答案:

答案 0 :(得分:1)

仅为了记录,您的脚本存在的问题是变量rules不在Lrn2CrawlSpider的范围内,因为它不共享相同的缩进,因此当alecxe时修复了缩进,变量rules现在变成了类的属性。稍后,继承的方法__init__()读取属性并编译规则并强制执行它们。

def __init__(self, *a, **kw):
    super(CrawlSpider, self).__init__(*a, **kw)
    self._compile_rules()

删除最后一行与此无关。

答案 1 :(得分:0)

我刚刚修改了缩进,在脚本末尾删除了spider = Lrn2CrawlSpider()行,通过scrapy runspider lrn2crawl.py运行了蜘蛛并且它会刮擦,跟踪链接,返回项目 - 您的规则有效。

这就是我正在运行的内容:

from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

class BillItem(Item):
    title = Field()
    body = Field()

class VersionItem(Item):
    title = Field()
    body = Field()

class SectionItem(Item):
    body = Field()

class Lrn2CrawlSpider(CrawlSpider):
    name = "lrn2crawl"
    allowed_domains = ["thomas.loc.gov"]
    start_urls = ["http://thomas.loc.gov/cgi-bin/query/z?c107:H.R.%s:" % bill for bill in xrange(000001,00040,00001) ### Sample of 40 bills; Total range of bills is 1-5767

    ]

    rules = (
            # Extract links matching /query/ fragment (restricting tho those inside the content body of the url)
            # and follow links from them (since no callback means follow=True by default).
            # Desired result: scrape all bill text & in the event that there are multiple versions, follow them & parse.
            Rule(SgmlLinkExtractor(allow=(r'/query/'), restrict_xpaths=('//div[@id="content"]')), callback='parse_bills', follow=True),

            # Extract links in the body of a bill-version & follow them.
           #Desired result: scrape all version text & in the event that there are multiple sections, follow them & parse.
            Rule(SgmlLinkExtractor(restrict_xpaths=('//div/a[2]')), callback='parse_versions', follow=True)
        )

    def parse_bills(self, response):
        hxs = HtmlXPathSelector(response)
        bills = hxs.select('//div[@id="content"]')
        scraped_bills = []
        for bill in bills:
            scraped_bill = BillItem() ### Bill object defined previously
            scraped_bill['title'] = bill.select('p/text()').extract()
            scraped_bill['body'] = response.body
            scraped_bills.append(scraped_bill)
        return scraped_bills

    def parse_versions(self, response):
        hxs = HtmlXPathSelector(response)
        versions = hxs.select('//div[@id="content"]')
        scraped_versions = []
        for version in versions:
            scraped_version = VersionItem() ### Version object defined previously
            scraped_version['title'] = version.select('center/b/text()').extract()
            scraped_version['body'] = response.body
            scraped_versions.append(scraped_version)
        return scraped_versions

    def parse_sections(self, response):
        hxs = HtmlXPathSelector(response)
        sections = hxs.select('//div[@id="content"]')
        scraped_sections = []
        for section in sections:
            scraped_section = SectionItem() ## Segment object defined previously
            scraped_section['body'] = response.body
            scraped_sections.append(scraped_section)
        return scraped_sections

希望有所帮助。