Scrapy Filter从网页复制提取的URLS

时间:2015-01-04 06:01:32

标签: python csv web-scraping scrapy fwrite

好的,所以我正在使用Scrapy。我目前正试图刮擦" snipplr.com/all/page"然后在页面中提取网址。然后,我在下次运行蜘蛛时再次提取URL,通过读取csv文件来过滤提取的URL。这就是计划,但不知何故,我得到了覆盖结果的错误。

处理:抓取链接的网页>检查CSV文件(如果已经过去提取过)>如果已经,IgnoreRequest / dropItem else添加到csv文件

蜘蛛码:

import scrapy
import csv

from scrapycrawler.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request

class DmozSpider(scrapy.Spider):
 name = "dmoz"
 allowed_domains = ["snipplr.com"]


def start_requests(self):
    #for i in xrange(1000):
    for i in range(2, 5):
        yield self.make_requests_from_url("http://www.snipplr.com/all/page/%d" % i)


def parse(self, response):
    for sel in response.xpath('//ol/li/h3'):
        item = DmozItem()
        #item['title'] = sel.xpath('a/text()').extract()
        item['link'] = sel.xpath('a[last()]/@href').extract()
        #item['desc'] = sel.xpath('text()').extract()

        reader = csv.reader(open('items.csv', 'w+')) #think it as a list
        for row in reader:
            if item['link'] == row:
                raise IgnoreRequest()

            else:
                f = open('items.csv', 'w')
                f.write(item[link'])
        yield item        

然而,我收到奇怪的结果,比如这些在我下次抓取不同页面时相互叠加但相反,我希望将结果添加到文件中,而不是覆盖

       clock/
/view/81327/chatting-swing-gui-tcp/
/view/82731/automate-system-setup/
/view/81215/rmi-factorial/
/view/81214/tcp-addition/
/view/81213/hex-octal-binary-calculator/
/view/81188/abstract-class-book-novel-magazine/
/view/81187/data-appending-to-file/
/view/81186/bouncing-ball-multithreading/
/view/81185/stringtokenizer/
/view/81184/prime-and-divisible-by-3/
/view/81183/packaging/
/view/81182/font-controller/
/view/81181/multithreaded-server-and-client/
/view/81180/simple-calculator/
/view/81179/inner-class-program/
/view/81114/cvv-dumps-paypals-egift-cards-tracks-wu-transfer-banklogins-/
/view/81038/magento-social-login/
/view/81037/faq-page-magento-extension/
/view/81036/slider-revolution-responsive-magento-extension/
/view/81025/bugfix-globalization/

代码中可能存在错误,可以随意编辑它以根据需要更正代码。感谢您提供帮助。

编辑:错别字

2 个答案:

答案 0 :(得分:3)

您实际上是在错误的地方进行操作,输出已抓取的数据应该在Item Pipeline中完成。

好吧,最好使用普通数据库并使用数据库约束过滤重复项,但无论如何,如果您仍想使用csv文件 - 创建一个首先读取现有内容并记住它的管道未来检查,如果之前没有看过蜘蛛检查的每个项目,如果没有,则写入:

import csv

from scrapy.exceptions import DropItem


class CsvWriterPipeline(object):
    def __init__(self):
        with open('items.csv', 'r') as f:
            self.seen = set([row for row in f])

        self.file = open('items.csv', 'a+')

    def process_item(self, item, spider):
        link = item['link']

        if link in self.seen:
            raise DropItem('Duplicate link found %s' % link)

        self.file.write(link)
        self.seen.add(link)

        return item

将其添加到ITEM_PIPELINES以将其启用:

ITEM_PIPELINES = {
    'myproject.pipelines.CsvWriterPipeline': 300
}

您的parse()回调只会产生Item

def parse(self, response):
    for sel in response.xpath('//ol/li/h3'):
        item = DmozItem()
        item['link'] = sel.xpath('a[last()]/@href').extract()

        yield item

答案 1 :(得分:0)

您正在打开文件,只是为了从头开始编写。要附加到文件,您需要使用'a''a+'

替换

f = open('items.csv', 'w')

f = open('items.csv', 'a')

基于BSD Library Functions Manual for fopen

 The argument mode points to a string beginning with one of the following
 sequences (Additional characters may follow these sequences.):

 ``r''   Open text file for reading.  The stream is positioned at the
         beginning of the file.

 ``r+''  Open for reading and writing.  The stream is positioned at the
         beginning of the file.

 ``w''   Truncate file to zero length or create text file for writing.
         The stream is positioned at the beginning of the file.

 ``w+''  Open for reading and writing.  The file is created if it does not
         exist, otherwise it is truncated.  The stream is positioned at
         the beginning of the file.

 ``a''   Open for writing.  The file is created if it does not exist.  The
         stream is positioned at the end of the file.  Subsequent writes
         to the file will always end up at the then current end of file,
         irrespective of any intervening fseek(3) or similar.

 ``a+''  Open for reading and writing.  The file is created if it does not
         exist.  The stream is positioned at the end of the file.  Subse-
         quent writes to the file will always end up at the then current
         end of file, irrespective of any intervening fseek(3) or similar.