Question

我似乎无法解决的简单问题。我正在努力抓取this website使用，我想收集此页和后续页面中列出的每个条目的名称和投票数。

到目前为止，我已经在Scrapy中创建了一个蜘蛛，但是结果没有正确格式化。而不是列出名称和投票数的每个公司的单独项目，我在所有公司页面上获得所有公司的名称和所有公司的投票。

即。我想要这个：

Item    voteCount   startUpName 
1       17,950      1stCompany 
2       11,487      2ndCompany 
3       7175        3rd company

但我得到了这个：

Item    voteCount               startUpName
1       17,950,114,877,175      1stCompany, 2ndCompany, 3rdCompany

据我所知，这是我如何定义我的x路径，但无论我尝试什么，我都无法让它工作。我确信我可以在后期制作中解决这个问题，但我真的很想知道scrapy在引擎盖下工作。

查看下面的代码，有没有人建议为什么会发生这种情况？

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from GSB.items import *


class startupSpider(CrawlSpider):
    name = "startupSpider"
    allowed_domains = ["reviewr.com"]

    #In the future this can be handed to the spider
    start_urls = [
        'https://app.reviewr.com/gsb/site/gsb2015/FdxbQVIpg8920,bx5052,bx5051,cb24476?sort=Popular&group=1305626&keyword='
    ]

    rules = (
        Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="page-more"]')), callback="parse_items", follow= True),
    )

    def parse_items(self, response):
        items = []

        for sel in response.xpath('//div[@class="submission-list"]'):
            item = GSBItem()
            item['startUpName'] = sel.xpath('//a/div/text()').extract()
            item['voteCount']   = sel.xpath('//div[@class="vote-count"]/text()').extract()
            item['desc']        = sel.xpath('//div[@class="teaser"]/text()').extract()
            items.append(item)
        return items

由于

赖安

Answer 1

首先，您在for循环中选择了错误的路径。因此，您只检查父列表，因此text（）将返回每个子项文本。我也改变了公司的名称xpath。注意第一个xpath中的contains函数，它包含具有多个类的第一个元素。正确的Xpath应该是这样的：

 for sel in response.xpath('//div[@class="submission-list"]/div[contains(@class, "submission")]'):
        item = GSBItem()
        item['startUpName'] = sel.xpath('//div[@class="name"]/text()').extract()
        item['voteCount']   = sel.xpath('//div[@class="vote-count"]/text()').extract()
        item['desc']        = sel.xpath('//div[@class="teaser"]/text()').extract()
        items.append(item)
    return items

Answer 2

问题是你在循环中的XPath不是基于主查询上下文。

<强>尝试：

for i, sel in enumerate(response.xpath('//div[@class="submission-list"]/div[contains(@class,"submission")]/div[@class="content"]'), start=1):
    startup_name = sel.xpath('.//div[@class="title"]/a/div[@class="name"]/text()').extract()[0].encode('utf-8')
    votes = sel.xpath('.//div[@class="count vote-widget "]/div[@class="vote-count"]/text()').extract()[0]
    print "[{}] {} has {} votes".format(i, startup_name, votes)

<强>输出：

[1] ProteCão has 17950 votes
[2] megaBoost has 11487 votes
[3] HoushmandSafar has 7175 votes
[4] SyncrHome has 6759 votes
[5] kidIN has 4398 votes
[6] KooKapp has 3979 votes
[7] Alerta UV has 3814 votes
[8] Athlon Hunters has 3775 votes
[9] Fernweh has 2738 votes
[10] Getmyweather has 2692 votes
[11] Feaglett has 2474 votes
[12] Legend of the coins has 2434 votes
[13] ACERCATE has 2306 votes
[14] Smart Automation has 2003 votes
[15] Nas4Nas has 1379 votes
[16] Hier_my_spa! has 1298 votes
[17] Watch Agent has 1130 votes
[18] LiftSync has 1053 votes
[19] WooU has 1005 votes
[20] Giftr has 909 votes
[21] FLNT has 659 votes
[22] Tencil has 616 votes
[23] Taker has 596 votes
[24] HidroBrain has 522 votes

有关详细信息，请参阅此演示文稿：

http://www.slideshare.net/scrapinghub/xpath-for-web-scraping

将HTML兄弟姐妹作为个别项目？

2 个答案: