Python Scrapy从源代码中获取href url并将其导出到JSON File

时间:2017-09-10 03:21:01

标签: python json csv scrapy

我是python和Scrapy的新手。我在网上搜索但没有找到很多Scrapy的例子。作为练习和挑战,我尝试使用Scrapy从源代码获取href链接并将其放入json文件中,我还发现了一个有用的github源代码,使用Scrapy和python从源代码生成电影URL。但不幸的是,这个github来源已过时,并未完全正常运作。在文件名movie_spider.py中,我对源代码进行了一行更改,并将url替换为最近的工作URL,我的意思是我改变了:

name, start_urls = 'ip_spider', ['http://iranproud.com/movies']

name, start_urls = 'ip_spider', ['http://www.iranproud.vip/irani-best-movies']

然后我用这个命令运行它:

scrapy crawl ip_spider -o movies_list.csv -t csv

目前movie.json有237部电影,但它是3年前还没有最近的电影。是否有人可以帮助我,请更改github https://github.com/xldrx/kodi-persian-contents应该更改或应该如何更新...

这是日志的一部分:

2017-09-09 21:45:05 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://iranproud.net/site.aspx?aspxerrorpath=/iran-1-movies/tv&cinema/yek-damaghe-naghabel> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2017-09-09 21:45:05 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://iranproud.net/iran-1-movies/tv&cinema/az-ma-behtaroon> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2017-09-09 21:45:05 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://iranproud.net/site.aspx?aspxerrorpath=/iran-1-movies/tv&cinema/inja-aseman-hamishe-baranist> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2017-09-09 21:45:10 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://iranproud.net/iran-1-movies/tv&cinema/behtarin-hamsayeh-donya> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2017-09-09 21:45:10 [scrapy.extensions.logstats] INFO: Crawled 137 pages (at 1 pages/min), scraped 0 items (at 0 items/min)

这里是movies.json文件的一部分(结果),但它不包括所有最近的电影网址:

{"video_url": "http://63.237.48.3/ipnx/media/movies/KhastehNabashiHQ.mp4", "title": ["Khasteh Nabashid"]},
{"video_url": "http://63.237.48.3/ipnx/media/movies/Khaneh_Neshin_HQ.mp4", "title": ["Khane Neshin"]},

感谢。

0 个答案:

没有答案