我想抓一些类似性质的网站(50+),但每个网站的信息略有不同。最终,这将输入数据库,但我也想导出到csv。
所以我想要做的是设置一个模板以便在每个网站上使用 - 基本上,设置所有可能的字段来抓取。但是,每个字段都不适用于每个网站。所以在这些情况下,我希望以与#34;完美网站相同的格式来抓取数据" (带有所有字段的那个!)只有那些不适用的空白。
以下是我的文件:
Items.py
import scrapy
class ProjectItem(scrapy.Item):
name = scrapy.Field()
address1 = scrapy.Field()
address2 = scrapy.Field()
address3 = scrapy.Field()
city = scrapy.Field()
random1 = scrapy.Field()
random2 = scrapy.Field()
random3 = scrapy.Field()
Spider.py
from scrapy.spiders import Spider
from Projct.items import ProjectItem
from scrapy.http import request
class MySpider(Spider):
name = "ENTER NAME"
allowed_domains = ["ENTER ALLOWED DOMAINS"]
start_urls = ["ENTER START URL"]
def parse(self, response):
''' Selectors '''
name = response.xpath('ENTER XPATH').extract()
address1 = response.xpath('').extract()
address2 = response.xpath('').extract()
address3 = response.xpath('').extract()
city = response.xpath('').extract()
random1 = response.xpath('').extract()
random2 = response.xpath('').extract()
random3 = response.xpath('').extract()
''' Items '''
html = response.xpath('ENTER ALL HTML XPATH').extract()
for html in htmls:
item = ProjectItem()
item["name"] = name
item["address1"] = address1
item["address2"] = address2
item["address3"] = address3
item["city"] = city
item["random1"] = random1
item["random2"] = random2
item["random3"] = random3
yield item
我现在面临的三个主要问题:
希望,这是有道理的,对此的任何其他一般性评论将不胜感激!