Scrapy:多个“ start_urls”产生重复的结果

时间:2018-11-17 04:25:24

标签: python scrapy

尽管根据the official document,我的简单代码似乎还可以,但它会生成意外重复的结果,例如:

    设置3个URL时
  • 9行/结果
  • 设置2个URL时,
  • 4行/结果

当我仅设置1个URL时,我的代码可以正常工作。另外,我尝试了the answer solution in this SO question,但并没有解决我的问题。

[Scrapy命令]

$ scrapy crawl test -o test.csv

[Scrapy spider:test.py]

import scrapy
from ..items import TestItem

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = [
        'file:///Users/Name/Desktop/tutorial/test1.html',
        'file:///Users/Name/Desktop/tutorial/test2.html',
        'file:///Users/Name/Desktop/tutorial/test3.html',
    ]

    def parse(self, response):
        for url in self.start_urls:
            table_rows = response.xpath('//table/tbody/tr')

            for table_row in table_rows:
                item = TestItem()
                item['test_01'] = table_row.xpath('td[1]/text()').extract_first()
                item['test_02'] = table_row.xpath('td[2]/text()').extract_first()

                yield item

[目标HTML:test1.html,test2.html,test3.html]

<html>
<head>
  <title>test2</title> <!-- Same as the file name  -->
</head>
  <body>
    <table>
        <tbody>
            <tr>
                <td>test2 A1</td> <!-- Same as the file name  -->
                <td>test2 B1</td> <!-- Same as the file name  -->
            </tr>
        </tbody>
    </table>
  </body>
</html>

[为3个URL生成的CSV结果]

test_01,test_02
test1 A1,test1 B1
test1 A1,test1 B1
test1 A1,test1 B1
test2 A1,test2 B1
test2 A1,test2 B1
test2 A1,test2 B1
test3 A1,test3 B1
test3 A1,test3 B1
test3 A1,test3 B1

[3个网址的预期结果]

test_01,test_02
test1 A1,test1 B1
test2 A1,test2 B1
test3 A1,test3 B1

[为2个URL生成的CSV结果]

test_01,test_02
test1 A1,test1 B1
test1 A1,test1 B1
test2 A1,test2 B1
test2 A1,test2 B1

[2个网址的预期结果]

test_01,test_02
test1 A1,test1 B1
test2 A1,test2 B1

1 个答案:

答案 0 :(得分:1)

您要再次遍历start_urls,您不需要这样做,因为scrapy已经为您完成了,所以现在您在start_urls上循环了两次。

改为尝试:

import scrapy
from ..items import TestItem

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = [
        'file:///Users/Name/Desktop/tutorial/test1.html',
        'file:///Users/Name/Desktop/tutorial/test2.html',
        'file:///Users/Name/Desktop/tutorial/test3.html',
    ]

    def parse(self, response):
        table_rows = response.xpath('//table/tbody/tr')

        for table_row in table_rows:
            item = TestItem()
            item['test_01'] = table_row.xpath('td[1]/text()').extract_first()
            item['test_02'] = table_row.xpath('td[2]/text()').extract_first()

            yield item