我正在尝试在此处http://www.abyznewslinks.com/allco.htm的页面中刮擦表格,但是我处于死胡同,想问更多有经验的人应该如何刮擦表格,这是我设法编写的代码https://pastebin.com/zZMfxSeR。我需要将这些字段抓取为输出CSV中的列-country_region,media_name,media_url,media_type,media_focus,语言,media_format。现在,我将一个单元格中列出的列中的所有元素都用逗号分隔,而不将每个元素分成几行,这是我的目标。我应该先按列还是其他方式进行迭代?
class AbyzrowbyrowSpider(scrapy.Spider):
name = 'abyziter'
allowed_domains = ['abyznewslinks.com']
start_urls = ['http://www.abyznewslinks.com/argen.htm']
def parse(self, response):
table = response.xpath("(//div)[position()>5 and position()<last()]//table//tr")
for row in table:
item=AbyzItem()
item['country']=response.xpath("/html/body/div[3]/table//td//font/text()[last()]").getall()
item['continent']=response.xpath("/html/body/div[3]//a[2]/text()").getall()
item['region']=response.xpath("/html/body/div[3]//a[3]/text()").getall()
item['country_region'] = row.xpath("td[1]/font/text()").getall()
item['media_url'] = row.xpath("td[2]/font/a/@href").getall()
item['media_name'] = row.xpath("td[2]/font/a/text()").getall()
item['media_type'] = row.xpath("td[3]//font/text()").getall()
item['media_focus'] = row.xpath("td[4]//font/text()").getall()
item['language'] = row.xpath("td[5]//font/text()").getall()
item['media_format'] = row.xpath("td[6]//font/text()").getall()
yield item
答案 0 :(得分:0)
好吧,您可以在评估之前检查值,如下所示:
country = response.xpath("/html/body/div[3]/table//td//font/text()[last()]").getall()
continent = response.xpath("/html/body/div[3]//a[2]/text()").getall()
region = response.xpath("/html/body/div[3]//a[3]/text()").getall()
country_region = row.xpath("td[1]/font/text()").getall()
media_url = row.xpath("td[2]/font/a/@href").getall()
media_name = row.xpath("td[2]/font/a/text()").getall()
media_type = row.xpath("td[3]//font/text()").getall()
media_focus = row.xpath("td[4]//font/text()").getall()
language = row.xpath("td[5]//font/text()").getall()
media_format = row.xpath("td[6]//font/text()").getall()
item['country'] = country if contrty else ''
item['continent'] = continent if continent else ''
item['region'] = region if region else ''
item['country_region'] = country_region if country_region else ''
item['media_url'] = media_url if media_url else ''
item['media_name'] = media_name if media_name else ''
item['media_type'] = media_type if media_type else ''
item['media_focus'] = media_focus if media_focus else ''
item['language'] = language if language else ''
item['media_format'] = media_format if media_format else ''