使用Scrapy,我正在努力实现特定的输出。我正尝试从证券交易委员会抓取财务文件。任务总结如下。
document
链接。10-K.txt
链接(说明:完整的提交文本文件)。.txt
并将文档存储在documents
中。使用此过程,我正在尝试实现的格式如下所示。如果要抓取单个页面,实现这样的多个字段插入并不是问题,但是当在多个页面上抓取任意数量的项目时,我会遇到问题。
[{cik_number : company_index_1,
documents : [document_1,
...
document_n]},
{cik_number : company_index_2,
documents : [document_1,
...
document_n]}
]
我目前正在获得以下输出(为方便起见,已缩短,但它演示了问题)。我应该如何将文档附加到单个项目,而不是为同一公司创建多个项目?我不确定如何处理同一字段中的多次插入。
[{'cik': ['1011290'],
'documents': [document_1},
...
{'cik': ['1011290'],
'documents': [document_n]}
]
下面给出了用于生成此输出的代码。 仅出于测试目的,我只是将response.url
值添加到documents
以简化输出。
#spider.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from document_scraper.items import CikItem
class MySpider(scrapy.Spider):
name = 'demo'
start_urls = ["https://www.sec.gov/divisions/corpfin/organization/cfia-123.htm"]
custom_settings = {
'DOWNLOAD_DELAY' : 0.25,
'FEED_FORMAT' : 'json',
'FEED_URI' : 'item.json'
}
def parse(self, response):
for sel in response.xpath('(//*[@id="cos"]//tr)[last()]'):
loader = ItemLoader(item = CikItem(), response = response)
loader.add_value('cik', sel.xpath("td[2]//text()").extract_first())
yield response.follow(
'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type=10-K' \
'&dateb=&owner=exclude&count=40'.format(sel.xpath("td[2]//text()").extract_first()),
callback = self.index,
meta = {"item" : loader.load_item()}
)
def index(self, response):
document_buttons = response.xpath('//*[@id="documentsbutton"]/@href')
for _url in document_buttons.extract():
yield response.follow(
_url,
callback=self.parse_page_two,
meta={"item": response.meta['item']}
)
def parse_page_two(self, response):
filing_txt = response.xpath('(//*[contains(@href, ".txt")])[last()]/@href')
for _url in filing_txt.extract():
yield response.follow(
_url,
callback=self.parse_page_three,
meta={"item": response.meta['item']}
)
def parse_page_three(self, response):
next_loader = ItemLoader(item = response.meta['item'], response = response)
next_loader.add_value('documents', response.url)
yield next_loader.load_item()
items.py
文件如下。
from scrapy.item import Item, Field
from scrapy.loader.processors import TakeFirst, Identity
class CikItem(Item):
cik = Field()
documents = Field()
更新
我设法获得了一段实现所需输出的代码。 spider.py
文件如下。
class MySpider(scrapy.Spider):
name = 'demo'
start_urls = ["https://www.sec.gov/divisions/corpfin/organization/cfia-123.htm"]
custom_settings = {
'DOWNLOAD_DELAY' : 0.25,
'FEED_FORMAT' : 'json',
'FEED_URI' : 'item.json'
}
def parse(self, response):
for sel in response.xpath('(//*[@id="cos"]//tr)[last()]'):
item = CikItem()
item['cik'] = sel.xpath("td[2]//text()").extract_first()
item['documents'] = []
yield response.follow(
'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type=10-K' \
'&dateb=&owner=exclude&count=40'.format(sel.xpath("td[2]//text()").extract_first()),
callback = self.index,
meta = {'item' : item}
)
def index(self, response):
document_buttons = response.xpath('//*[@id="documentsbutton"]/@href')
for _url in document_buttons.extract():
yield response.follow(
_url,
callback=self.parse_page_two,
meta={'item': response.meta['item']}
)
def parse_page_two(self, response):
filing_txt = response.xpath('(//*[contains(@href, ".txt")])[last()]/@href')
for _url in filing_txt.extract():
yield response.follow(
_url,
callback=self.parse_page_three,
meta={'item': response.meta['item']}
)
def parse_page_three(self, response):
item = response.meta['item']
item['documents'].append(response.url)
if len(self.crawler.engine.slot.inprogress) == 1:
return item
我很想知道如何改进此代码以及是否可以实施ItemLoader()
来执行此任务。