我正在使用github(https://github.com/mssun/android-apps-crawler的命令行应用程序,在linux虚拟机上运行)为第三方appstore(www.mumayi.com)抓取他们托管的所有apk文件,以便我可以分析它们。我知道该网站有很多APK。
然而,当我运行该程序时,首先它工作得很好并且很快找到文件(平均35-50),然后将它们插入数据库以便稍后下载,但是在1-2分钟之后工作得很好,无论我多久都能找到它,即使我知道那里还有更多的APK文件。
有没有人可以说明为什么会这样?这与网站不喜欢我正在通过他们的文件查看程序有关吗?
我在下面的命令行中包含了一个示例日志,请注意我在停止查找文件后只让它运行大约10分钟,我让其他运行24小时,结果仍然相同
matt@matt-VirtualBox:~/Downloads/android-apps-crawler-master/crawler$ ./crawl.sh mumayi.com
/home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/spiders/android_apps_spider.py:3: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
from scrapy.spider import Spider
/home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/spiders/android_apps_spider.py:7: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
from scrapy import log
2015-08-31 09:38:28 [scrapy] INFO: Scrapy 1.0.3 started (bot: android_apps_crawler)
2015-08-31 09:38:28 [scrapy] INFO: Optional features available: ssl, http11
2015-08-31 09:38:28 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'android_apps_crawler.spiders', 'SPIDER_MODULES': ['android_apps_crawler.spiders'], 'LOG_LEVEL': 'INFO', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11(KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11', 'BOT_NAME': 'android_apps_crawler'}
2015-08-31 09:38:28 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-08-31 09:38:28 [scrapy] INFO: Enabled downloader middlewares: DownloaderMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-31 09:38:28 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
../repo/databases/mumayi.com.db
2015-08-31 09:38:28 [scrapy] INFO: Enabled item pipelines: AppPipeline, SQLitePipeline
2015-08-31 09:38:28 [scrapy] INFO: Spider opened
2015-08-31 09:38:28 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-08-31 09:38:41 [py.warnings] WARNING: /home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/spiders/android_apps_spider.py:85: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead
log.msg("Catch an application: %s" % url, level=log.INFO)
2015-08-31 09:38:41 [scrapy] INFO: Catch an application: http://down.mumayi.com/54049
2015-08-31 09:38:41 [py.warnings] WARNING: /home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/pipelines.py:12: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead
log.msg("Catch an AppItem", level=log.INFO)
2015-08-31 09:38:41 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:41 [py.warnings] WARNING: /home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/pipelines.py:33: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead
log.msg("Inserting into database");
2015-08-31 09:38:41 [scrapy] INFO: Inserting into database
2015-08-31 09:38:44 [scrapy] INFO: Catch an application: http://down.mumayi.com/989871
2015-08-31 09:38:45 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:45 [scrapy] INFO: Inserting into database
2015-08-31 09:38:45 [scrapy] INFO: Catch an application: http://down.mumayi.com/1003630
2015-08-31 09:38:45 [scrapy] INFO: Catch an application: http://down.mumayi.com/217624
2015-08-31 09:38:45 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:45 [scrapy] INFO: Inserting into database
2015-08-31 09:38:45 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:45 [scrapy] INFO: Inserting into database
2015-08-31 09:38:45 [scrapy] INFO: Catch an application: http://down.mumayi.com/970142
2015-08-31 09:38:45 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:45 [scrapy] INFO: Inserting into database
Ignore request!
Ignore request!
Ignore request!
2015-08-31 09:38:47 [scrapy] INFO: Catch an application: http://down.mumayi.com/42860
2015-08-31 09:38:47 [scrapy] INFO: Catch an application: http://down.mumayi.com/555845
2015-08-31 09:38:47 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:47 [scrapy] INFO: Inserting into database
2015-08-31 09:38:47 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:47 [scrapy] INFO: Inserting into database
2015-08-31 09:38:47 [scrapy] INFO: Catch an application: http://down.mumayi.com/121890
2015-08-31 09:38:47 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:47 [scrapy] INFO: Inserting into database
2015-08-31 09:38:48 [scrapy] INFO: Catch an application: http://down.mumayi.com/197417
2015-08-31 09:38:48 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:48 [scrapy] INFO: Inserting into database
2015-08-31 09:38:48 [scrapy] INFO: Catch an application: http://down.mumayi.com/254262
2015-08-31 09:38:48 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:48 [scrapy] INFO: Inserting into database
2015-08-31 09:38:49 [scrapy] INFO: Catch an application: http://down.mumayi.com/308575
2015-08-31 09:38:49 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:49 [scrapy] INFO: Inserting into database
2015-08-31 09:38:50 [scrapy] INFO: Catch an application: http://down.mumayi.com/227335
2015-08-31 09:38:50 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:50 [scrapy] INFO: Inserting into database
2015-08-31 09:38:50 [scrapy] ERROR: Spider error processing <GET http://down.mumayi.com/minisetup/970142> (referer: http://www.mumayi.com/android-970142.html)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/spiders/android_apps_spider.py", line 36, in parse
self.parse_xpath(response, xpath_rule[key]))
File "/home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/spiders/android_apps_spider.py", line 82, in parse_xpath
sel = Selector(response)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/unified.py", line 80, in __init__
_root = LxmlDocument(response, self._parser)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmldocument.py", line 27, in __new__
cache[parser] = _factory(response, parser)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmldocument.py", line 13, in _factory
body = response.body_as_unicode().strip().encode('utf8') or '<html/>'
AttributeError: 'Response' object has no attribute 'body_as_unicode'
2015-08-31 09:38:50 [scrapy] INFO: Catch an application: http://down.mumayi.com/45243
2015-08-31 09:38:50 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:50 [scrapy] INFO: Inserting into database
2015-08-31 09:38:50 [scrapy] INFO: Catch an application: http://down.mumayi.com/7937
2015-08-31 09:38:50 [scrapy] INFO: Catch an application: http://down.mumayi.com/858308
2015-08-31 09:38:50 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:50 [scrapy] INFO: Inserting into database
2015-08-31 09:38:50 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:51 [scrapy] INFO: Inserting into database
2015-08-31 09:38:51 [scrapy] INFO: Catch an application: http://down.mumayi.com/499346
2015-08-31 09:38:52 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:52 [scrapy] INFO: Inserting into database
2015-08-31 09:38:52 [scrapy] INFO: Catch an application: http://down.mumayi.com/1003438
2015-08-31 09:38:52 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:52 [scrapy] INFO: Inserting into database
2015-08-31 09:38:52 [scrapy] INFO: Catch an application: http://down.mumayi.com/549777
2015-08-31 09:38:52 [scrapy] INFO: Catch an application: http://down.mumayi.com/1002249
2015-08-31 09:38:52 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:52 [scrapy] INFO: Inserting into database
2015-08-31 09:38:52 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:52 [scrapy] INFO: Inserting into database
2015-08-31 09:38:53 [scrapy] INFO: Catch an application: http://down.mumayi.com/335562
2015-08-31 09:38:53 [scrapy] INFO: Catch an AppItem
2015-08-31 09:38:53 [scrapy] INFO: Inserting into database
2015-08-31 09:39:21 [scrapy] INFO: Catch an application: http://down.mumayi.com/51129
2015-08-31 09:39:21 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:21 [scrapy] INFO: Inserting into database
2015-08-31 09:39:22 [scrapy] INFO: Catch an application: http://down.mumayi.com/72090
2015-08-31 09:39:22 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:22 [scrapy] INFO: Inserting into database
2015-08-31 09:39:23 [scrapy] INFO: Catch an application: http://down.mumayi.com/318245
2015-08-31 09:39:23 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:23 [scrapy] INFO: Inserting into database
2015-08-31 09:39:23 [scrapy] INFO: Catch an application: http://down.mumayi.com/52958
2015-08-31 09:39:23 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:24 [scrapy] INFO: Inserting into database
2015-08-31 09:39:25 [scrapy] INFO: Catch an application: http://down.mumayi.com/212803
2015-08-31 09:39:25 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:25 [scrapy] INFO: Inserting into database
2015-08-31 09:39:26 [scrapy] INFO: Catch an application: http://down.mumayi.com/287
2015-08-31 09:39:26 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:26 [scrapy] INFO: Inserting into database
2015-08-31 09:39:26 [scrapy] INFO: Catch an application: http://down.mumayi.com/426381
2015-08-31 09:39:26 [scrapy] INFO: Catch an application: http://down.mumayi.com/32326
2015-08-31 09:39:26 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:26 [scrapy] INFO: Inserting into database
2015-08-31 09:39:26 [scrapy] INFO: Catch an application: http://down.mumayi.com/113156
2015-08-31 09:39:26 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:26 [scrapy] INFO: Inserting into database
2015-08-31 09:39:26 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:26 [scrapy] INFO: Inserting into database
2015-08-31 09:39:28 [scrapy] INFO: Crawled 184 pages (at 184 pages/min), scraped 29 items (at 29 items/min)
2015-08-31 09:39:28 [scrapy] INFO: Catch an application: http://down.mumayi.com/230146
2015-08-31 09:39:28 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:28 [scrapy] INFO: Inserting into database
2015-08-31 09:39:28 [scrapy] INFO: Catch an application: http://down.mumayi.com/208
2015-08-31 09:39:28 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:28 [scrapy] INFO: Inserting into database
2015-08-31 09:39:28 [scrapy] ERROR: Spider error processing <GET http://down.mumayi.com/minisetup/318245> (referer: http://www.mumayi.com/android-318245.html)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/spiders/android_apps_spider.py", line 36, in parse
self.parse_xpath(response, xpath_rule[key]))
File "/home/matt/Downloads/android-apps-crawler-master/crawler/android_apps_crawler/spiders/android_apps_spider.py", line 82, in parse_xpath
sel = Selector(response)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/unified.py", line 80, in __init__
_root = LxmlDocument(response, self._parser)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmldocument.py", line 27, in __new__
cache[parser] = _factory(response, parser)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmldocument.py", line 13, in _factory
body = response.body_as_unicode().strip().encode('utf8') or '<html/>'
AttributeError: 'Response' object has no attribute 'body_as_unicode'
2015-08-31 09:39:28 [scrapy] INFO: Catch an application: http://down.mumayi.com/59
2015-08-31 09:39:28 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:28 [scrapy] INFO: Inserting into database
2015-08-31 09:39:30 [scrapy] INFO: Catch an application: http://down.mumayi.com/882209
2015-08-31 09:39:30 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:30 [scrapy] INFO: Inserting into database
2015-08-31 09:39:30 [scrapy] INFO: Catch an application: http://down.mumayi.com/987896
2015-08-31 09:39:30 [scrapy] INFO: Catch an application: http://down.mumayi.com/97686
2015-08-31 09:39:30 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:30 [scrapy] INFO: Inserting into database
2015-08-31 09:39:30 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:30 [scrapy] INFO: Inserting into database
2015-08-31 09:39:31 [scrapy] INFO: Catch an application: http://down.mumayi.com/979277
2015-08-31 09:39:31 [scrapy] INFO: Catch an application: http://down.mumayi.com/350618
2015-08-31 09:39:31 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:31 [scrapy] INFO: Inserting into database
2015-08-31 09:39:31 [scrapy] INFO: Catch an application: http://down.mumayi.com/343323
2015-08-31 09:39:31 [scrapy] INFO: Catch an application: http://down.mumayi.com/21799
2015-08-31 09:39:31 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:31 [scrapy] INFO: Inserting into database
2015-08-31 09:39:31 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:31 [scrapy] INFO: Inserting into database
2015-08-31 09:39:31 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:31 [scrapy] INFO: Inserting into database
2015-08-31 09:39:31 [scrapy] INFO: Catch an application: http://down.mumayi.com/485394
2015-08-31 09:39:31 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:31 [scrapy] INFO: Inserting into database
2015-08-31 09:39:32 [scrapy] INFO: Catch an application: http://down.mumayi.com/24615
2015-08-31 09:39:32 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:32 [scrapy] INFO: Inserting into database
2015-08-31 09:39:32 [scrapy] INFO: Catch an application: http://down.mumayi.com/872176
2015-08-31 09:39:32 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:32 [scrapy] INFO: Inserting into database
2015-08-31 09:39:32 [scrapy] INFO: Catch an application: http://down.mumayi.com/63575
2015-08-31 09:39:32 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:32 [scrapy] INFO: Inserting into database
2015-08-31 09:39:33 [scrapy] INFO: Catch an application: http://down.mumayi.com/1007326
2015-08-31 09:39:33 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:33 [scrapy] INFO: Inserting into database
2015-08-31 09:39:35 [scrapy] INFO: Catch an application: http://down.mumayi.com/62258
2015-08-31 09:39:35 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:35 [scrapy] INFO: Inserting into database
2015-08-31 09:39:35 [scrapy] INFO: Catch an application: http://down.mumayi.com/64880
2015-08-31 09:39:35 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:35 [scrapy] INFO: Inserting into database
2015-08-31 09:39:35 [scrapy] INFO: Catch an application: http://down.mumayi.com/455675
2015-08-31 09:39:35 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:35 [scrapy] INFO: Inserting into database
2015-08-31 09:39:36 [scrapy] INFO: Catch an application: http://down.mumayi.com/851783
2015-08-31 09:39:36 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:36 [scrapy] INFO: Inserting into database
2015-08-31 09:39:40 [scrapy] INFO: Catch an application: http://down.mumayi.com/14037
2015-08-31 09:39:40 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:40 [scrapy] INFO: Inserting into database
2015-08-31 09:39:43 [scrapy] INFO: Catch an application: http://down.mumayi.com/274799
2015-08-31 09:39:43 [scrapy] INFO: Catch an AppItem
2015-08-31 09:39:43 [scrapy] INFO: Inserting into database
2015-08-31 09:40:28 [scrapy] INFO: Crawled 333 pages (at 149 pages/min), scraped 50 items (at 21 items/min)
2015-08-31 09:41:28 [scrapy] INFO: Crawled 538 pages (at 205 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:42:28 [scrapy] INFO: Crawled 795 pages (at 257 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:43:28 [scrapy] INFO: Crawled 1044 pages (at 249 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:44:28 [scrapy] INFO: Crawled 1269 pages (at 225 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:45:28 [scrapy] INFO: Crawled 1616 pages (at 347 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:46:28 [scrapy] INFO: Crawled 2041 pages (at 425 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:47:28 [scrapy] INFO: Crawled 2417 pages (at 376 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:48:28 [scrapy] INFO: Crawled 2790 pages (at 373 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:49:28 [scrapy] INFO: Crawled 3131 pages (at 341 pages/min), scraped 50 items (at 0 items/min)
2015-08-31 09:50:28 [scrapy] INFO: Crawled 3463 pages (at 332 pages/min), scraped 50 items (at 0 items/min)
^C2015-08-31 09:51:11 [scrapy] INFO: Received SIGINT, shutting down gracefully. Send again to force
2015-08-31 09:51:11 [scrapy] INFO: Closing spider (shutdown)
2015-08-31 09:51:21 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 3,
'downloader/request_bytes': 5030837,
'downloader/request_count': 4613,
'downloader/request_method_count/GET': 4613,
'downloader/response_bytes': 49981622,
'downloader/response_count': 4613,
'downloader/response_status_count/200': 3742,
'downloader/response_status_count/302': 871,
'dupefilter/filtered': 545833,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2015, 8, 31, 8, 51, 21, 908163),
'item_scraped_count': 50,
'log_count/ERROR': 2,
'log_count/INFO': 170,
'log_count/WARNING': 3,
'offsite/domains': 140,
'offsite/filtered': 81248,
'request_depth_max': 143,
'response_received_count': 3742,
'scheduler/dequeued': 4616,
'scheduler/dequeued/disk': 4616,
'scheduler/enqueued': 20763,
'scheduler/enqueued/disk': 20763,
'spider_exceptions/AttributeError': 2,
'start_time': datetime.datetime(2015, 8, 31, 8, 38, 28, 184357)}
2015-08-31 09:51:21 [scrapy] INFO: Spider closed (shutdown)
我的蜘蛛代码:
import re
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.http import HtmlResponse
from scrapy import log
from urlparse import urlparse
from urlparse import urljoin
from android_apps_crawler.items import AppItem
from android_apps_crawler import settings
from android_apps_crawler import custom_parser
class AndroidAppsSpider(Spider):
name = "android_apps_spider"
scrape_rules = settings.SCRAPE_RULES
def __init__(self, market=None, database_dir="../repo/databases/", *args, **kwargs):
super(AndroidAppsSpider, self).__init__(*args, **kwargs)
self.allowed_domains = settings.ALLOWED_DOMAINS[market]
self.start_urls = settings.START_URLS[market]
settings.MARKET_NAME = market
settings.DATABASE_DIR = database_dir
def parse(self, response):
response_domain = urlparse(response.url).netloc
appItemList = []
cookie = {}
xpath_rule = self.scrape_rules['xpath']
for key in xpath_rule.keys():
if key in response_domain:
appItemList.extend(
self.parse_xpath(response, xpath_rule[key]))
break
custom_parser_rule = self.scrape_rules['custom_parser']
for key in custom_parser_rule.keys():
if key in response_domain:
appItemList.extend(
getattr(custom_parser, custom_parser_rule[key])(response))
break
#if "appchina" in response_domain:
# xpath = "//a[@id='pc-download' and @class='free']/@href"
# appItemList.extend(self.parse_xpath(response, xpath))
#elif "hiapk" in response_domain:
# xpath = "//a[@class='linkbtn d1']/@href"
# appItemList.extend(self.parse_xpath(response, xpath))
#elif "android.d.cn" in response_domain:
# xpath = "//a[@class='down']/@href"
# appItemList.extend(self.parse_xpath(response, xpath))
#elif "anzhi" in response_domain:
# xpath = "//div[@id='btn']/a/@onclick"
# appItemList.extend(self.parse_anzhi(response, xpath))
#else:
# pass
sel = Selector(response)
for url in sel.xpath('//a/@href').extract():
url = urljoin(response.url, url)
yield Request(url, meta=cookie, callback=self.parse)
for item in appItemList:
yield item
#def parse_appchina(self, response):
# appItemList = []
# hxs = HtmlXPathSelector(response)
# for url in hxs.select(
# "//a[@id='pc-download' and @class='free']/@href"
# ).extract():
# url = urljoin(response.url, url)
# log.msg("Catch an application: %s" % url, level=log.INFO)
# appItem = AppItem()
# appItem['url'] = url
# appItemList.append(appItem)
# return appItemList
def parse_xpath(self, response, xpath):
appItemList = []
sel = Selector(response)
for url in sel.xpath(xpath).extract():
url = urljoin(response.url, url)
log.msg("Catch an application: %s" % url, level=log.INFO)
appItem = AppItem()
appItem['url'] = url
appItemList.append(appItem)
return appItemList
#def parse_anzhi(self, response, xpath):
# appItemList = []
# hxs = HtmlXPathSelector(response)
# for script in hxs.select(xpath).extract():
# id = re.search(r"\d+", script).group()
# url = "http://www.anzhi.com/dl_app.php?s=%s&n=5" % (id,)
# appItem = AppItem()
# appItem['url'] = url
# appItemList.append(appItem)
# return appItemList