在scrapy项目中显示kwic

时间:2015-04-07 10:54:08

标签: python scrapy

下午好,

我正在尝试编写一个蜘蛛,它会在上下文中为我提供关键字的结果。我找到了这个链接:python truncate text around keyword 这解释了一种方法,我已经模拟了一点来满足我的需求,但我现在不再在我的数据库中获取任何数据。 (唯一改变的是我添加了代码来抓住kwic作为我的一个项目。)

我的蜘蛛:

# tabbing in python is apparently VERY important so be aware and make sure 
# things that should line up do so

# import the CrawlSpider Class, along with it's Rules, (this lets us recursively
# crawl pages)

from scrapy.contrib.spiders import CrawlSpider, Rule

#import the link extractor, this extracts links from pages

from scrapy.contrib.linkextractors import LinkExtractor

# import our items as defined in items.py

from basic.items import BasicItem

# import re which allows us to compare strings

import re

# create a new Spider with the CrawlSpider Class

class BasicSpiderSpider(CrawlSpider):

    # Name of the spider, this is used to run it, (i.e Scrapy Crawl basic_spider)

    name = "basic_spider"

    # domains that the spider is allowed to crawl over

    allowed_domains = ["news24.com"]

    # where to start crawling from

    start_urls = [
        'http://www.news24.com/SouthAfrica/News/Six-things-to-know-about-the-illegal-mining-boom-20140626',
    ]

    # Rules for the link extractor, (i.e where it's allowed to look for links, 
    # what to do once it's found them, and whether it's allowed to follow them

    rules = (Rule (LinkExtractor(), callback="parse_items", follow= True),
    )

   # defining the callback function

    def parse_items(self, response):

        # defines the Top level XPath where all of our information can be found, needs to be
        # as specific as possible to avoid duplicates

        for title in response.xpath('//*[@id="aspnetForm"]'):

            # List of keywords to search through.

            key = re.compile("joburg|durban|children", re.IGNORECASE)

            # extracting the data to compare with the keywords, this is for the 
            # headlines, the join converts it from a list type to a string type

            headlist = title.xpath('//*[@id="article_special"]//h1/text()').extract()
            head = ''.join(headlist)

            # and this is for the article.

            artlist = title.xpath('//*[@id="article-body"]//text()').extract()
            art = ''.join(artlist)

            # if any keywords are found in the headline:

            if key.search(head):

                    # define the top level xpath again as python won't look outside 
                    # it's current fuction

                    for thing in response.xpath('//*[@id="aspnetForm"]'):

    # using this because i'm wanting to show the context of returned articles, found this here:
    # https://stackoverflow.com/questions/4319473/python-truncate-text-around-keyword

                        def find_with_context(haystack, needle, context_length, escape=True):
                            if escape:
                                needle = re.escape(needle)
                            return re.findall(r'\b(.{,%d})\b(%s)\b(.{,%d})\b' % (context_length, needle, context_length), haystack)


                        # fills the items defined in items.py with relevant data

                        item = BasicItem()
                        item['Headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract()

                        #calls find_with_context

                        item["Article"] = find_with_context("joburg|durban|children", (art), 50)
    #                    item["Article"] = thing.xpath('//*[@id="article-body"]//text()').extract
                        item["Date"] = thing.xpath('//*[@id="spnDate"]/text()').extract()
                        item["Link"] = response.url

                        # I found that even with being careful about my XPaths I
                        # still got empty fields and lines, the below fixes that

                        if item['Headline']:
                            if item["Article"]:
                                if item["Date"]:
                                    yield item

            # if the headline item doesn't match, check the article item.

            elif key.search(art):
                #if last_crawled > response.xpath('//*[@id="spnDate"]/text()').extract():
                    for thing in response.xpath('//*[@id="aspnetForm"]'):

    # using this because i'm wanting to show the context of returned articles, found this here:
    # https://stackoverflow.com/questions/4319473/python-truncate-text-around-keyword                        

                        def find_with_context(haystack, needle, context_length, escape=True):
                            if escape:
                                needle = re.escape(needle)
                            return re.findall(r'\b(.{,%d})\b(%s)\b(.{,%d})\b' % (context_length, needle, context_length), haystack)


                        item = BasicItem()
                        item['Headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract()
                        #item["Article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract()
                        item["Article"] = find_with_context("joburg|durban|children", (art), 50)
                        item["Date"] = thing.xpath('//*[@id="spnDate"]/text()').extract()
                        item["Link"] = response.url

                        if item['Headline']:
                            if item["Article"]:
                                if item["Date"]:
                                    yield item

我的日志文件可以在这里下载:http://1drv.ms/1y0umXn

看起来它不再识别页面:http://www.news24.com/SouthAfrica/News/Gunshots-screaming-in-Manenberg-overnight-20150407

符合搜索条件。

任何帮助将不胜感激

亲切的问候, 格兰特

1 个答案:

答案 0 :(得分:0)

通过查看日志,我可以看到该URL的请求已被过滤掉:

  

2015-04-07 12:26:17 + 0200 [basic_spider] DEBUG:忽略链接(深度>   1):   http://www.news24.com/SouthAfrica/News/Gunshots-screaming-in-Manenberg-overnight-20150407

原因在于它的深度,它会过滤掉许多其他请求。也许所有这些都不会被过滤掉。

从日志开始我们可以看到:

  

2015-04-07 12:26:15 + 0200 [scrapy]信息:被覆盖的设置:   {' NEWSPIDER_MODULE&#39 ;:' basic.spiders',' DEPTH_LIMIT':1 ,   ' SPIDER_MODULES':[' basic.spiders'],' BOT_NAME':'基本',' LOG_FILE':   ' log.log',' DOWNLOAD_DELAY':0.25}

表示您已设置'DEPTH_LIMIT': 1。尝试删除此设置,或将其设置为更高的值,以查看是否达到了预期的效果。