如果条件在Scrapy

时间:2019-07-02 22:21:26

标签: python web web-scraping scrapy

我正在使用scrapy刮擦给定URL中的标签,并检查标签中的URL链接是否与网站的URL相匹配。我想将结果导出到csv,并用一列指示是否存在匹配项。

我有以下代码,但不确定如何添加匹配条件:

import scrapy
import pandas as pd
import csv
from scrapy.crawler import CrawlerProcess


class urlsitem(scrapy.Item):
    status=scrapy.Field()
    url=scrapy.Field()
    canonical=scrapy.Field()


class URLSpider(scrapy.Spider):
    handle_httpstatus_list = [301]
    REDIRECT_ENABLED=False
    name = "urls"
    data = ['https://www.wayfair.com/bed-bath/sb0/bedding-c481592.html']
    start_urls =  list(data.iloc[0:,0])


def parse(self, response):

    item=urlsitem()
    item['status']=response.status
    item['url'] = response.url
    item['canonical']=response.xpath("//link[@rel='canonical' and @href]/@href").extract()
    yield item

2 个答案:

答案 0 :(得分:0)

我不太理解“ 如果标记中的URL链接与网站的URL相匹配”部分。如果您尝试在.csv文件中创建一列以指示找到的URL与response.url相同,则可以执行以下操作;

  • 创建另一个名为is_match的二进制字段

is_match = scrapy.Field()

  • 如果url和规范相同,则将其设置为1,否则为0

    item['is_match'] = 1 if item['canonical'] == response.url else 0

您可以使用if-else块实现相同的行为,但这更加优雅。这称为三元运算符。您可以查看this页以了解更多详细信息。

如果您只想抓取匹配的网址,则可以在parse方法中放置if块。

if response.url == response.xpath("//link[@rel='canonical' and @href]/@href").extract():
    item = urlsitem()
    item['status'] = response.status
    item['url'] = response.url
    item['canonical'] = response.xpath("//link[@rel='canonical' and@href]/@href").extract()
    yield item

答案 1 :(得分:0)

import scrapy
import pandas as pd
import csv
from scrapy.crawler import CrawlerProcess


class urlsitem(scrapy.Item):
    status=scrapy.Field()
    url=scrapy.Field()
    canonical=scrapy.Field()
    is_matched=scrapy.Field()


class URLSpider(scrapy.Spider):
    handle_httpstatus_list = [301]
    REDIRECT_ENABLED=False
    name = "urls"
    data = ['https://www.wayfair.com/bed-bath/sb0/bedding-c481592.html']
    start_urls =  list(data.iloc[0:,0])


def parse(self, response):
    your_tag = 'XXX'
    item=urlsitem()
    item['status']=response.status
    item['url'] = response.url
    item['canonical']=response.xpath("//link[@rel='canonical' and @href]/@href").extract()
    item['is_matched'] = True if your_tag in response.url else False
    yield item