我正在从网页上的12个表中提取元数据和url,并且在我可以正常工作的同时,我对xpath和scrapy还是很陌生,所以有没有更简洁的方法可以做到这一点?
当我尝试各种xpath并意识到每个表的每个表行都在重复时,我最初会获得大量重复数据。我的解决方案是枚举表并遍历每个表,仅获取该表的行。感觉可能有一种更简单的方法可以做到,但我现在不确定。
import scrapy
class LinkCheckerSpider(scrapy.Spider):
name = 'foodstandardsagency'
allowed_domains = ['ratings.food.gov.uk']
start_urls = ['https://ratings.food.gov.uk/open-data/en-gb/']
def parse(self, response):
print(response.url)
tables = response.xpath('//*[@id="openDataStatic"]//table')
num_tables = len(tables)
for tabno in range(num_tables):
search_path = '// *[ @ id = "openDataStatic"] / table[%d] / tr'%tabno
rows = response.xpath(search_path)
for row in rows:
local_authority = row.xpath('td[1]//text()').extract()
last_update = row.xpath('td[2]//text()').extract()
num_businesses = row.xpath('td[3]//text()').extract()
xml_file_descr = row.xpath('td[4]//text()').extract()
xml_file = row.xpath('td[4]/a/@href').extract()
yield {'local_authority': local_authority[1],
'last_update':last_update[1],
'num_businesses':num_businesses[1],
'xml_file':xml_file[0],
'xml_file_descr':xml_file_descr[1]
}
'''
我正在使用它
scrapy runspider fsa_xpath.py
答案 0 :(得分:2)
您可以遍历第一个xpath返回的表选择器:
# Defining a function to see if they match
def is_winning(arr1, arr2):
# Grabbing the first element in each array
# denoted by the [0], for the "0th" element
arr1_first_ele = arr1[0]
arr2_first_ele = arr2[0]
# If the first element in the first array matches the first element in the second
if arr1_first_ele == arr2_first_ele:
# Print out they match
print("They match")
# Otherwise
else:
# Print out that they dont
print("They don't match")
def main():
# Example arrays
test_array_one = [1,3,4]
test_array_two = [5,4,3]
# This should print out "They don't match"
is_winning(test_array_one, test_array_two)
# Example arrays
test_array_three = [6,7,8]
test_array_four = [6,5,4]
# This should print out "They match"
is_winning(test_array_three, test_array_four)
main()
您是用行完成的。