我声明我已经阅读了有关同一问题的一些答案,但我无法解决我的问题。 我是Python的新手,我正在尝试从Aptoide中提取关于应用程序和商店的数据,我希望输出结果为.json文件(或csv),但我得到的文件是空的,我不知道原因。
这是我的代码:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
class ApptoideItem(scrapy.Item):
app_name = scrapy.Field()
rating = scrapy.Field()
security_status = scrapy.Field()
good_flag = scrapy.Field()
licence_flag = scrapy.Field()
fake_flag = scrapy.Field()
freeze_flag = scrapy.Field()
virus_flag = scrapy.Field()
five_stars = scrapy.Field()
four_stars = scrapy.Field()
three_stars = scrapy.Field()
two_stars = scrapy.Field()
one_stars = scrapy.Field()
info = scrapy.Field()
download = scrapy.Field()
version = scrapy.Field()
size = scrapy.Field()
link = scrapy.Field()
store = scrapy.Field()
class AppSpider(CrawlSpider):
name = "second"
allowed_domains = ["aptoide.com"]
start_urls = [ "http://www.aptoide.com/page/morestores/type:top" ]
rules = (
Rule(LinkExtractor(allow=(r'\w+\.store\.aptoide\.com$'))),
Rule(LinkExtractor(allow=(r'\w+\.store\.aptoide\.com/app/market')), callback='parse_item')
)
def parse_item(self, response):
item = ApptoideItem()
item['app_name']= str(response.css(".app_name::text").extract()[0])
item['rating']= str(response.css(".app_rating_number::text").extract()[0])
item['security_status']= str(response.css("#show_app_malware_data::text").extract()[0])
item['good_flag']= int(response.css(".good > div:nth-child(3)::text").extract()[0])
item['licence_flag']= int(response.css(".license > div:nth-child(3)::text").extract()[0])
item['fake_flag']= int(response.css(".fake > div:nth-child(3)::text").extract()[0])
item['freeze_flag']= int(response.css(".freeze > div:nth-child(3)::text").extract()[0])
item['virus_flag']= int(response.css(".virus > div:nth-child(3)::text").extract()[0])
item['five_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(1) > div:nth-child(3)::text").extract()[0])
item['four_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(2) > div:nth-child(3)::text").extract()[0])
item['three_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(3) > div:nth-child(3)::text").extract()[0])
item['two_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(4) > div:nth-child(3)::text").extract()[0])
item['link']= response.url
item['one_stars']= int(response.css("div.app_ratting_bar_holder:nth-child(5) > div:nth-child(3)::text").extract()[0])
item['download']= int(response.css("p.app_meta::text").re('(\d[\w\.]*)')[0].replace('.', ''))
item['version']= str(response.css("p.app_meta::text").re('(\d[\w\.]*)')[1])
item['size']= str(response.css("p.app_meta::text").re('(\d[\w\.]*)')[2])
item['store_name']= str(response.css(".sec_header_txt::text").extract()[0])
item['info_store']= str(response.css(".ter_header2::text").extract()[0])
yield item
我很确定问题是parse_item方法永远不会被调用,我不知道原因。第一条规则跟随商店,而第二条规则跟随商店中的应用程序。我认为正则表达式的语法是正确的。
设置为:
CLOSESPIDER_PAGECOUNT = 1000
CLOSESPIDER_ITEMCOUNT = 500
CONCURRENT_REQUESTS = 1
CONCURRENT_ITEMS = 1
BOT_NAME = 'nuovo'
SPIDER_MODULES = ['nuovo.spiders']
NEWSPIDER_MODULE = 'nuovo.spiders'
有人能找到问题并建议我解决方案吗?
答案 0 :(得分:0)
您的代码充满了错误,当您运行蜘蛛时,您可以保存日志并使用grep执行它:
scrapy crawl spidername 2>&1 | tee crawl.log
我找到的错误很少:
ApptoideItem
缺少store_name
和其他几个字段。int()
转换是不安全的,这意味着如果您的response.css
返回None,如果找不到任何内容,则会收到错误。要解决第二点,我建议调查scrapy ItemLoaders,这将允许您指定某些字段的默认行为,例如将字段_flag
中的项目转为布尔值等。
同样@Jan在评论中提到,您应该使用extract_first()
方法而不是extract()[0]
,extract_first允许您指定何时找不到任何内容的默认属性,即.extract_first(default=0)