使用xpath使用Scrapy从多个表中提取数据

时间:2019-04-25 16:40:54

标签: xpath scrapy

我正在从网页上的12个表中提取元数据和url,并且在我可以正常工作的同时,我对xpath和scrapy还是很陌生,所以有没有更简洁的方法可以做到这一点?

当我尝试各种xpath并意识到每个表的每个表行都在重复时,我最初会获得大量重复数据。我的解决方案是枚举表并遍历每个表,仅获取该表的行。感觉可能有一种更简单的方法可以做到,但我现在不确定。

import scrapy

class LinkCheckerSpider(scrapy.Spider):
    name = 'foodstandardsagency'
    allowed_domains = ['ratings.food.gov.uk']
    start_urls = ['https://ratings.food.gov.uk/open-data/en-gb/']

    def parse(self, response):

        print(response.url)
        tables = response.xpath('//*[@id="openDataStatic"]//table')

        num_tables = len(tables)

        for tabno in range(num_tables):

            search_path = '// *[ @ id = "openDataStatic"] / table[%d] /  tr'%tabno

            rows = response.xpath(search_path)


            for row in rows:
                local_authority = row.xpath('td[1]//text()').extract()
                last_update = row.xpath('td[2]//text()').extract()
                num_businesses = row.xpath('td[3]//text()').extract()
                xml_file_descr = row.xpath('td[4]//text()').extract()
                xml_file = row.xpath('td[4]/a/@href').extract()

                yield {'local_authority': local_authority[1],
                      'last_update':last_update[1],
                      'num_businesses':num_businesses[1],
                      'xml_file':xml_file[0],
                      'xml_file_descr':xml_file_descr[1]
                        }

'''

我正在使用它

scrapy runspider fsa_xpath.py

1 个答案:

答案 0 :(得分:2)

您可以遍历第一个xpath返回的表选择器:

# Defining a function to see if they match
def is_winning(arr1, arr2):
    # Grabbing the first element in each array
    # denoted by the [0], for the "0th" element
    arr1_first_ele = arr1[0]
    arr2_first_ele = arr2[0]

    # If the first element in the first array matches the first element in the second
    if arr1_first_ele == arr2_first_ele:
        # Print out they match
        print("They match")
    # Otherwise
    else:
        # Print out that they dont
        print("They don't match")

def main():
    # Example arrays
    test_array_one = [1,3,4]
    test_array_two = [5,4,3]
    # This should print out "They don't match"
    is_winning(test_array_one, test_array_two)

    # Example arrays
    test_array_three = [6,7,8]
    test_array_four = [6,5,4]
    # This should print out "They match"
    is_winning(test_array_three, test_array_four)

main()

您是用行完成的。