我正在使用scrapy刮擦给定URL中的标签,并检查标签中的URL链接是否与网站的URL相匹配。我想将结果导出到csv,并用一列指示是否存在匹配项。
我有以下代码,但不确定如何添加匹配条件:
import scrapy
import pandas as pd
import csv
from scrapy.crawler import CrawlerProcess
class urlsitem(scrapy.Item):
status=scrapy.Field()
url=scrapy.Field()
canonical=scrapy.Field()
class URLSpider(scrapy.Spider):
handle_httpstatus_list = [301]
REDIRECT_ENABLED=False
name = "urls"
data = ['https://www.wayfair.com/bed-bath/sb0/bedding-c481592.html']
start_urls = list(data.iloc[0:,0])
def parse(self, response):
item=urlsitem()
item['status']=response.status
item['url'] = response.url
item['canonical']=response.xpath("//link[@rel='canonical' and @href]/@href").extract()
yield item
答案 0 :(得分:0)
我不太理解“ 如果标记中的URL链接与网站的URL相匹配”部分。如果您尝试在.csv文件中创建一列以指示找到的URL与response.url相同,则可以执行以下操作;
is_match
的二进制字段 is_match = scrapy.Field()
如果url和规范相同,则将其设置为1,否则为0
item['is_match'] = 1 if item['canonical'] == response.url else 0
您可以使用if-else块实现相同的行为,但这更加优雅。这称为三元运算符。您可以查看this页以了解更多详细信息。
如果您只想抓取匹配的网址,则可以在parse
方法中放置if块。
if response.url == response.xpath("//link[@rel='canonical' and @href]/@href").extract():
item = urlsitem()
item['status'] = response.status
item['url'] = response.url
item['canonical'] = response.xpath("//link[@rel='canonical' and@href]/@href").extract()
yield item
答案 1 :(得分:0)
import scrapy
import pandas as pd
import csv
from scrapy.crawler import CrawlerProcess
class urlsitem(scrapy.Item):
status=scrapy.Field()
url=scrapy.Field()
canonical=scrapy.Field()
is_matched=scrapy.Field()
class URLSpider(scrapy.Spider):
handle_httpstatus_list = [301]
REDIRECT_ENABLED=False
name = "urls"
data = ['https://www.wayfair.com/bed-bath/sb0/bedding-c481592.html']
start_urls = list(data.iloc[0:,0])
def parse(self, response):
your_tag = 'XXX'
item=urlsitem()
item['status']=response.status
item['url'] = response.url
item['canonical']=response.xpath("//link[@rel='canonical' and @href]/@href").extract()
item['is_matched'] = True if your_tag in response.url else False
yield item