尽管根据the official document,我的简单代码似乎还可以,但它会生成意外重复的结果,例如:
当我仅设置1个URL时,我的代码可以正常工作。另外,我尝试了the answer solution in this SO question,但并没有解决我的问题。
[Scrapy命令]
$ scrapy crawl test -o test.csv
[Scrapy spider:test.py]
import scrapy
from ..items import TestItem
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = [
'file:///Users/Name/Desktop/tutorial/test1.html',
'file:///Users/Name/Desktop/tutorial/test2.html',
'file:///Users/Name/Desktop/tutorial/test3.html',
]
def parse(self, response):
for url in self.start_urls:
table_rows = response.xpath('//table/tbody/tr')
for table_row in table_rows:
item = TestItem()
item['test_01'] = table_row.xpath('td[1]/text()').extract_first()
item['test_02'] = table_row.xpath('td[2]/text()').extract_first()
yield item
[目标HTML:test1.html,test2.html,test3.html]
<html>
<head>
<title>test2</title> <!-- Same as the file name -->
</head>
<body>
<table>
<tbody>
<tr>
<td>test2 A1</td> <!-- Same as the file name -->
<td>test2 B1</td> <!-- Same as the file name -->
</tr>
</tbody>
</table>
</body>
</html>
[为3个URL生成的CSV结果]
test_01,test_02
test1 A1,test1 B1
test1 A1,test1 B1
test1 A1,test1 B1
test2 A1,test2 B1
test2 A1,test2 B1
test2 A1,test2 B1
test3 A1,test3 B1
test3 A1,test3 B1
test3 A1,test3 B1
[3个网址的预期结果]
test_01,test_02
test1 A1,test1 B1
test2 A1,test2 B1
test3 A1,test3 B1
[为2个URL生成的CSV结果]
test_01,test_02
test1 A1,test1 B1
test1 A1,test1 B1
test2 A1,test2 B1
test2 A1,test2 B1
[2个网址的预期结果]
test_01,test_02
test1 A1,test1 B1
test2 A1,test2 B1
答案 0 :(得分:1)
您要再次遍历start_urls
,您不需要这样做,因为scrapy已经为您完成了,所以现在您在start_urls
上循环了两次。
改为尝试:
import scrapy
from ..items import TestItem
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = [
'file:///Users/Name/Desktop/tutorial/test1.html',
'file:///Users/Name/Desktop/tutorial/test2.html',
'file:///Users/Name/Desktop/tutorial/test3.html',
]
def parse(self, response):
table_rows = response.xpath('//table/tbody/tr')
for table_row in table_rows:
item = TestItem()
item['test_01'] = table_row.xpath('td[1]/text()').extract_first()
item['test_02'] = table_row.xpath('td[2]/text()').extract_first()
yield item