我似乎无法解决的简单问题。我正在努力抓取this website使用,我想收集此页和后续页面中列出的每个条目的名称和投票数。
到目前为止,我已经在Scrapy中创建了一个蜘蛛,但是结果没有正确格式化。而不是列出名称和投票数的每个公司的单独项目,我在所有公司页面上获得所有公司的名称和所有公司的投票。
即。我想要这个:
Item voteCount startUpName
1 17,950 1stCompany
2 11,487 2ndCompany
3 7175 3rd company
但我得到了这个:
Item voteCount startUpName
1 17,950,114,877,175 1stCompany, 2ndCompany, 3rdCompany
据我所知,这是我如何定义我的x路径,但无论我尝试什么,我都无法让它工作。我确信我可以在后期制作中解决这个问题,但我真的很想知道scrapy在引擎盖下工作。
查看下面的代码,有没有人建议为什么会发生这种情况?
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from GSB.items import *
class startupSpider(CrawlSpider):
name = "startupSpider"
allowed_domains = ["reviewr.com"]
#In the future this can be handed to the spider
start_urls = [
'https://app.reviewr.com/gsb/site/gsb2015/FdxbQVIpg8920,bx5052,bx5051,cb24476?sort=Popular&group=1305626&keyword='
]
rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="page-more"]')), callback="parse_items", follow= True),
)
def parse_items(self, response):
items = []
for sel in response.xpath('//div[@class="submission-list"]'):
item = GSBItem()
item['startUpName'] = sel.xpath('//a/div/text()').extract()
item['voteCount'] = sel.xpath('//div[@class="vote-count"]/text()').extract()
item['desc'] = sel.xpath('//div[@class="teaser"]/text()').extract()
items.append(item)
return items
由于
赖安
答案 0 :(得分:1)
首先,您在for循环中选择了错误的路径。因此,您只检查父列表,因此text()将返回每个子项文本。我也改变了公司的名称xpath。注意第一个xpath中的contains函数,它包含具有多个类的第一个元素。正确的Xpath应该是这样的:
for sel in response.xpath('//div[@class="submission-list"]/div[contains(@class, "submission")]'):
item = GSBItem()
item['startUpName'] = sel.xpath('//div[@class="name"]/text()').extract()
item['voteCount'] = sel.xpath('//div[@class="vote-count"]/text()').extract()
item['desc'] = sel.xpath('//div[@class="teaser"]/text()').extract()
items.append(item)
return items
答案 1 :(得分:1)
问题是你在循环中的XPath不是基于主查询上下文。
<强>尝试:强>
for i, sel in enumerate(response.xpath('//div[@class="submission-list"]/div[contains(@class,"submission")]/div[@class="content"]'), start=1):
startup_name = sel.xpath('.//div[@class="title"]/a/div[@class="name"]/text()').extract()[0].encode('utf-8')
votes = sel.xpath('.//div[@class="count vote-widget "]/div[@class="vote-count"]/text()').extract()[0]
print "[{}] {} has {} votes".format(i, startup_name, votes)
<强>输出:强>
[1] ProteCão has 17950 votes
[2] megaBoost has 11487 votes
[3] HoushmandSafar has 7175 votes
[4] SyncrHome has 6759 votes
[5] kidIN has 4398 votes
[6] KooKapp has 3979 votes
[7] Alerta UV has 3814 votes
[8] Athlon Hunters has 3775 votes
[9] Fernweh has 2738 votes
[10] Getmyweather has 2692 votes
[11] Feaglett has 2474 votes
[12] Legend of the coins has 2434 votes
[13] ACERCATE has 2306 votes
[14] Smart Automation has 2003 votes
[15] Nas4Nas has 1379 votes
[16] Hier_my_spa! has 1298 votes
[17] Watch Agent has 1130 votes
[18] LiftSync has 1053 votes
[19] WooU has 1005 votes
[20] Giftr has 909 votes
[21] FLNT has 659 votes
[22] Tencil has 616 votes
[23] Taker has 596 votes
[24] HidroBrain has 522 votes
有关详细信息,请参阅此演示文稿:
http://www.slideshare.net/scrapinghub/xpath-for-web-scraping