Question

我正在尝试使用相同的抓取蜘蛛文件从多个start_URL获取数据。我的目标是通过更改Web地址中的特定ID来创建多个URL，并按ID顺序运行Spider。所有ID均保存在CSV文件中。我的ID的正式名称是CIK。为简单起见，我在此处放置了两个CIK（在原始文件中，我有大约19000个CIK）。

1326801

320193

因此手动创建的网站应如下所示：

https://www.secform4.com/insider-trading/1326801-0.htm

https://www.secform4.com/insider-trading/320193-0.htm

我的问题是：如何导入保存在CSV文件中的CIK，命令scrapy spider手动构建Start_URL，并顺序运行所创建的URL？

此外，其中一些CIK在特定网站上没有数据。如何命令Spider忽略手动创建的不可用URL？

我只是一个初学者。如果可能的话，请建议我代码中的特定更改（不胜感激特定的代码）。预先谢谢你。

import scrapy
class InsiderSpider(scrapy.Spider):
    name = 'insider'
    cik = 320193
    allowed_domains = ['www.secform4.com']
    start_urls = ['https://www.secform4.com/insider-trading/'+ str(cik) +'-0.htm']

Answer 1

可以将所有URL写入start_urls，但这不是最佳实践。

使用

class MySpider(Spider):
    name = 'csv'

    def start_requests(self):
        with open('file.csv') as f:
            for line in f:
                if not line.strip():
                    continue
                yield Request(line)

显示在： How to loop through multiple URLs to scrape from a CSV file in Scrapy? 代替。

Answer 2

df = '1326801', '320193'
urls = ['https://www.secform4.com/insider-trading/' + str(i) +'-0.htm' for i in df]
print(urls)
['https://www.secform4.com/insider-trading/1326801-0.htm', 'https://www.secform4.com/insider-trading/320193-0.htm']

通过从CSV文件中的数据手动创建多个URL来废弃数据

2 个答案: