Question

我通过我下载的插件将时间戳传递给 DynamoDB 。蜘蛛每两分钟就在cron上。以前，它曾经从网站XPath获取时间戳，所以它是独一无二的;但目前每次新运行都会生成一个新的时间戳，因此每次运行都会创建一个新的条目。您能否指导我一个管道解决方案来检查是否已存在相同的url，所以蜘蛛根本不会跳过它？

我的蜘蛛：

def parse(self, response):

    for item in response.xpath("//li[contains(@class, 'river-block')]"):
        url = item.xpath(".//h2/a/@href").extract()[0]
        stamp = Timestamp().timestamp
        yield scrapy.Request(url, callback=self.get_details, meta={'stamp': stamp})

def get_details(self, response):
        article = ArticleItem()
        article['title'] = response.xpath("//header/h1/text()").extract_first()
        article['url'] = format(shortener.short(response.url))
        article['stamp'] = response.meta['stamp']
        yield article

我的管道：

class DynamoDBStorePipeline(object):

def process_item(self, item, spider):
    dynamodb = boto3.resource('dynamodb',region_name="us-west-2")

    table = dynamodb.Table('x')

    table.put_item(
    Item={
    'url': str(item['url']),
    'title': item['title'].encode('utf-8'),
    'stamp': item['stamp'],
    }
    )
    return item

Answer 1

默认情况下，Scrapy不会多次执行相同的请求。

有关详细信息，您可以阅读here关于 dont_filter 的信息，默认为false，忽略重复项过滤器。

无论如何另一种解决方案，你可以创建一个数组，并检查你的数组中是否存在你的标题。我认为在这里检查重复比在管道中更好，因为如果是重复的情况，你不会做另外你不需要的事情

url  = response.xpath("//header/h1/text()").extract_first()
if( url not in yourArray) :
    article = ArticleItem()
    article['title'] = response.xpath("//header/h1/text()").extract_first()
    article['url'] = url
    article['stamp'] = response.meta['stamp']
    yourArray.append(url)
    yield article

Answer 2

在深入研究stackoverflow问题和Boto3文档后，我能够提出解决方案：

class DynamoDBStorePipeline(object):

def process_item(self, item, spider):
    dynamodb = boto3.resource('dynamodb',region_name="us-west-2")

    table = dynamodb.Table('x')

    table.put_item(
    Item={
    'link': str(item['link']),
    'title': item['title'].encode('utf-8'),
    'stamp': item['stamp'],
    },
    ConditionExpression = 'attribute_not_exists(link) AND attribute_not_exists(title)',
    )
    return item

Scrapy检查重复管道

2 个答案: