我想要做的是抓取公司信息(thisisavailable.eu.pn/company.html)并添加到董事会中,将所有董事会成员与来自不同页面的相应数据联系起来。
理想情况下,我从示例页面获取的数据将是:
{
"company": "Mycompany Ltd",
"code": "3241234",
"phone": "2323232",
"email": "info@mycompany.com",
"board": {
"1": {
"name": "Margaret Sawfish",
"code": "9999999999"
},
"2": {
"name": "Ralph Pike",
"code": "222222222"
}
}
}
我搜索了Google和SO(例如here和here以及Scrapy docs等),但无法找到问题的解决方案。
我能够拼凑起来:
items.py:
import scrapy
class company_item(scrapy.Item):
name = scrapy.Field()
code = scrapy.Field()
board = scrapy.Field()
phone = scrapy.Field()
email = scrapy.Field()
pass
class person_item(scrapy.Item):
name = scrapy.Field()
code = scrapy.Field()
pass
蜘蛛/ example.py:
import scrapy
from try.items import company_item,person_item
class ExampleSpider(scrapy.Spider):
name = "example"
#allowed_domains = ["http://thisisavailable.eu.pn"]
start_urls = ['http://thisisavailable.eu.pn/company.html']
def parse(self, response):
if response.xpath("//table[@id='company']"):
yield self.parse_company(response)
pass
elif response.xpath("//table[@id='person']"):
yield self.parse_person(response)
pass
pass
def parse_company(self, response):
Company = company_item()
Company['name'] = response.xpath("//table[@id='company']/tbody/tr[1]/td[2]/text()").extract_first()
Company['code'] = response.xpath("//table[@id='company']/tbody/tr[2]/td[2]/text()").extract_first()
board = []
for person_row in response.xpath("//table[@id='board']/tbody/tr/td[1]"):
Person = person_item()
Person['name'] = person_row.xpath("a/text()").extract()
print (person_row.xpath("a/@href").extract_first())
request = scrapy.Request('http://thisisavailable.eu.pn/'+person_row.xpath("a/@href").extract_first(), callback=self.parse_person)
request.meta['Person'] = Person
return request
board.append(Person)
Company['board'] = board
return Company
def parse_person(self, response):
print('PERSON!!!!!!!!!!!')
print (response.meta)
Person = response.meta['Person']
Person['name'] = response.xpath("//table[@id='person']/tbody/tr[1]/td[2]/text()").extract_first()
Person['code'] = response.xpath("//table[@id='person']/tbody/tr[2]/td[2]/text()").extract_first()
yield Person
更新: 正如Rafael注意到的那样,最初的问题是allow_domains是错误的 - 我暂时评论它,现在当我运行它时,我得到(由于低代表而在URL中添加了*):
scrapy crawl example 2017-03-07 09:41:12 [scrapy.utils.log] INFO:Scrapy 1.3.2启动(bot:proov)2017-03-07 09:41:12 [scrapy.utils.log]信息:被覆盖的设置:{' NEWSPIDER_MODULE': ' proov.spiders',' SPIDER_MODULES':[' proov.spiders'], ' ROBOTSTXT_OBEY':是的,' BOT_NAME':' proov'} 2017-03-07 09:41:12 [scrapy.middleware]信息:已启用扩展程序: [' scrapy.extensions.logstats.LogStats&#39 ;, ' scrapy.extensions.telnet.TelnetConsole&#39 ;, ' scrapy.extensions.corestats.CoreStats'] 2017-03-07 09:41:13 [scrapy.middleware]信息:启用下载中间件: [' scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware&#39 ;, ' scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware&#39 ;, ' scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware&#39 ;, ' scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware&#39 ;, ' scrapy.downloadermiddlewares.useragent.UserAgentMiddleware&#39 ;, ' scrapy.downloadermiddlewares.retry.RetryMiddleware&#39 ;, ' scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware&#39 ;, ' scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware&#39 ;, ' scrapy.downloadermiddlewares.redirect.RedirectMiddleware&#39 ;, ' scrapy.downloadermiddlewares.cookies.CookiesMiddleware&#39 ;, ' scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware&#39 ;, ' scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-03-07 09:41:13 [scrapy.middleware]信息:启用蜘蛛中间件: [' scrapy.spidermiddlewares.httperror.HttpErrorMiddleware&#39 ;, ' scrapy.spidermiddlewares.offsite.OffsiteMiddleware&#39 ;, ' scrapy.spidermiddlewares.referer.RefererMiddleware&#39 ;, ' scrapy.spidermiddlewares.urllength.UrlLengthMiddleware&#39 ;, ' scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-03-07 09:41:13 [scrapy.middleware]信息:启用项目管道:[] 2017-03-07 09:41:13 [scrapy.core.engine]信息:蜘蛛开启2017-03-07 09:41:13 [scrapy.extensions.logstats]信息:抓取0页(0页/分钟), 刮掉0件(0件/分)2017-03-07 09:41:13 [scrapy.extensions.telnet] DEBUG:正在监听的Telnet控制台 127.0.0.1:6023 2017-03-07 09:41:14 [scrapy.core.engine] DEBUG:Crawled(404)http://*thisisavailable.eu.pn/robots.txt> (引用者:无) 2017-03-07 09:41:14 [scrapy.core.engine] DEBUG:Crawled(200)http://*thisisavailable.eu.pn/scrapy/company.html> (引用者:无) person.html person2.html 2017-03-07 09:41:15 [scrapy.core.engine] DEBUG:Crawled(200)http://thisisavailable.eu.pn/person2.html> (referer:http://*thisisavailable.eu.pn/company.html)PERSON !!!!!!!!!!! 2017-03-07 09:41:15 [scrapy.core.scraper] DEBUG:从< 200刮掉 HTTP://*thisisavailable.eu.pn/person2.html> {'代码':你' 222222222', '姓名':你' Kaspar K \ xe4nnuotsa'} 2017-03-07 09:41:15 [scrapy.core.engine]信息:关闭蜘蛛(已完成)2017-03-07 09:41:15 [scrapy.statscollectors]信息:倾倒Scrapy统计: {' downloader / request_bytes':936,' downloader / request_count':3, ' downloader / request_method_count / GET':3, ' downloader / response_bytes':1476,' downloader / response_count':3, ' downloader / response_status_count / 200':2, ' downloader / response_status_count / 404':1,' finish_reason': '已完成',' finish_time':datetime.datetime(2017,3,7,7,41,15, 571000),' item_scraped_count':1,' log_count / DEBUG':5, ' log_count / INFO':7,' request_depth_max':1, ' response_received_count':3,' scheduler / dequeued':2, ' scheduler / dequeued / memory':2,' scheduler / enqueued':2, ' scheduler / enqueued / memory':2,' start_time':datetime.datetime(2017, 3,7,7,41,13,404000)} 2017-03-07 09:41:15 [scrapy.core.engine] 信息:蜘蛛关闭(完成)
如果使用" -o file.json"运行,则文件内容为:
[{" code":" 222222222"," name":" Ralph Pike"}]
再远一点,但我仍然不知道如何让它发挥作用。
有人可以帮助我做这项工作吗?
答案 0 :(得分:1)
您的问题与拥有多个项目无关,即使将来会有。
输出
中解释了您的问题[scrapy.spidermiddlewares.offsite] DEBUG:过滤现场请求'kidplay-wingsuit.c9users.io':http://thisisavailable.eu.pn/scrapy/person2.html> 2017-03-06 10:44:33
这意味着您将进入allowed_domains列表之外的域。
您允许的域名错误。它应该是
allowed_domains = ["thisisavailable.eu.pn"]
注意:
而不是为Person
使用不同的项目,只需将其用作Company
中的字段,并在抓取时为其分配dict
或list