Question

我希望在Scrapy中尝试自动执行我的html表格抓取。这是我到目前为止所拥有的：

import scrapy
import pandas as pd

class XGSpider(scrapy.Spider):

    name = 'expectedGoals'

    start_urls = [
        'https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures',
    ]

    def parse(self, response):

        matches = []

        for row in response.xpath('//*[@id="sched_ks_3232_1"]//tbody/tr'):

            match = {
                'home': row.xpath('td[4]//text()').extract_first(),
                'homeXg': row.xpath('td[5]//text()').extract_first(),
                'score': row.xpath('td[6]//text()').extract_first(),
                'awayXg': row.xpath('td[7]//text()').extract_first(),
                'away': row.xpath('td[8]//text()').extract_first()
            }

            matches.append(match)

        x = pd.DataFrame(
            matches, columns=['home', 'homeXg', 'score', 'awayXg', 'away'])

        yield x.to_csv("xG.csv", sep=",", index=False)

它工作正常，但是如您所见，我正在对home对象的键（homeXg，match等）进行硬编码。我想自动将键抓取到列表中，然后从所述列表中初始化一个字典。问题是，我不知道如何通过索引遍历xpath。例如，

 headers = [] 
        for row in response.xpath('//*[@id="sched_ks_3260_1"]/thead/tr'): 
            yield{
                'first': row.xpath('th[1]/text()').extract_first(),
                'second': row.xpath('th[2]/text()').extract_first()
            }

是否可以将th[1]，th[2]，th[3]等放入数字作为索引的for循环中，并将值附加到列表中？例如

row.xpath('th[i]/text()').extract_first()吗？

Answer 1

未经测试，但应该可以工作：

Select

在Scrapy中使用For循环将Xpath值追加到列表

1 个答案: