Question

我正在尝试在此处http://www.abyznewslinks.com/allco.htm的页面中刮擦表格，但是我处于死胡同，想问更多有经验的人应该如何刮擦表格，这是我设法编写的代码https://pastebin.com/zZMfxSeR。我需要将这些字段抓取为输出CSV中的列-country_region，media_name，media_url，media_type，media_focus，语言，media_format。现在，我将一个单元格中列出的列中的所有元素都用逗号分隔，而不将每个元素分成几行，这是我的目标。我应该先按列还是其他方式进行迭代？

class AbyzrowbyrowSpider(scrapy.Spider):
name = 'abyziter'
allowed_domains = ['abyznewslinks.com']
start_urls = ['http://www.abyznewslinks.com/argen.htm']

def parse(self, response):
    table = response.xpath("(//div)[position()>5 and position()<last()]//table//tr")
    for row in table:
        item=AbyzItem()
        item['country']=response.xpath("/html/body/div[3]/table//td//font/text()[last()]").getall()
        item['continent']=response.xpath("/html/body/div[3]//a[2]/text()").getall()
        item['region']=response.xpath("/html/body/div[3]//a[3]/text()").getall()
        item['country_region'] = row.xpath("td[1]/font/text()").getall()
        item['media_url'] = row.xpath("td[2]/font/a/@href").getall()
        item['media_name'] = row.xpath("td[2]/font/a/text()").getall()
        item['media_type'] = row.xpath("td[3]//font/text()").getall()
        item['media_focus'] = row.xpath("td[4]//font/text()").getall()
        item['language'] = row.xpath("td[5]//font/text()").getall()
        item['media_format'] = row.xpath("td[6]//font/text()").getall()
        yield item

Answer 1

好吧，您可以在评估之前检查值，如下所示：

country = response.xpath("/html/body/div[3]/table//td//font/text()[last()]").getall()
continent = response.xpath("/html/body/div[3]//a[2]/text()").getall()
region = response.xpath("/html/body/div[3]//a[3]/text()").getall()
country_region = row.xpath("td[1]/font/text()").getall()
media_url = row.xpath("td[2]/font/a/@href").getall()
media_name = row.xpath("td[2]/font/a/text()").getall()
media_type = row.xpath("td[3]//font/text()").getall()
media_focus = row.xpath("td[4]//font/text()").getall()
language = row.xpath("td[5]//font/text()").getall()
media_format = row.xpath("td[6]//font/text()").getall()
item['country'] = country if contrty else ''
item['continent'] = continent if continent else ''
item['region'] = region if region else ''
item['country_region'] = country_region if country_region else ''
item['media_url'] = media_url if media_url else ''
item['media_name'] = media_name if media_name else ''
item['media_type'] = media_type if media_type else ''
item['media_focus'] = media_focus if media_focus else ''
item['language'] = language if language else ''
item['media_format'] = media_format if media_format else ''

行不平整的刮刮板

1 个答案: