将HTML兄弟姐妹作为个别项目?

时间:2015-12-06 22:52:44

标签: python html xpath css-selectors scrapy

我似乎无法解决的简单问题。我正在努力抓取this website使用,我想收集此页和后续页面中列出的每个条目的名称和投票数。

到目前为止,我已经在Scrapy中创建了一个蜘蛛,但是结果没有正确格式化。而不是列出名称和投票数的每个公司的单独项目,我在所有公司页面上获得所有公司的名称和所有公司的投票。

即。我想要这个:

Item    voteCount   startUpName 
1       17,950      1stCompany 
2       11,487      2ndCompany 
3       7175        3rd company

但我得到了这个:

Item    voteCount               startUpName
1       17,950,114,877,175      1stCompany, 2ndCompany, 3rdCompany

据我所知,这是我如何定义我的x路径,但无论我尝试什么,我都无法让它工作。我确信我可以在后期制作中解决这个问题,但我真的很想知道scrapy在引擎盖下工作。

查看下面的代码,有没有人建议为什么会发生这种情况?

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from GSB.items import *


class startupSpider(CrawlSpider):
    name = "startupSpider"
    allowed_domains = ["reviewr.com"]

    #In the future this can be handed to the spider
    start_urls = [
        'https://app.reviewr.com/gsb/site/gsb2015/FdxbQVIpg8920,bx5052,bx5051,cb24476?sort=Popular&group=1305626&keyword='
    ]

    rules = (
        Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="page-more"]')), callback="parse_items", follow= True),
    )

    def parse_items(self, response):
        items = []

        for sel in response.xpath('//div[@class="submission-list"]'):
            item = GSBItem()
            item['startUpName'] = sel.xpath('//a/div/text()').extract()
            item['voteCount']   = sel.xpath('//div[@class="vote-count"]/text()').extract()
            item['desc']        = sel.xpath('//div[@class="teaser"]/text()').extract()
            items.append(item)
        return items

由于

赖安

2 个答案:

答案 0 :(得分:1)

首先,您在for循环中选择了错误的路径。因此,您只检查父列表,因此text()将返回每个子项文本。我也改变了公司的名称xpath。注意第一个xpath中的contains函数,它包含具有多个类的第一个元素。正确的Xpath应该是这样的:

 for sel in response.xpath('//div[@class="submission-list"]/div[contains(@class, "submission")]'):
        item = GSBItem()
        item['startUpName'] = sel.xpath('//div[@class="name"]/text()').extract()
        item['voteCount']   = sel.xpath('//div[@class="vote-count"]/text()').extract()
        item['desc']        = sel.xpath('//div[@class="teaser"]/text()').extract()
        items.append(item)
    return items

答案 1 :(得分:1)

问题是你在循环中的XPath不是基于主查询上下文。

<强>尝试:

for i, sel in enumerate(response.xpath('//div[@class="submission-list"]/div[contains(@class,"submission")]/div[@class="content"]'), start=1):
    startup_name = sel.xpath('.//div[@class="title"]/a/div[@class="name"]/text()').extract()[0].encode('utf-8')
    votes = sel.xpath('.//div[@class="count vote-widget "]/div[@class="vote-count"]/text()').extract()[0]
    print "[{}] {} has {} votes".format(i, startup_name, votes)

<强>输出:

[1] ProteCão has 17950 votes
[2] megaBoost has 11487 votes
[3] HoushmandSafar has 7175 votes
[4] SyncrHome has 6759 votes
[5] kidIN has 4398 votes
[6] KooKapp has 3979 votes
[7] Alerta UV has 3814 votes
[8] Athlon Hunters has 3775 votes
[9] Fernweh has 2738 votes
[10] Getmyweather has 2692 votes
[11] Feaglett has 2474 votes
[12] Legend of the coins has 2434 votes
[13] ACERCATE has 2306 votes
[14] Smart Automation has 2003 votes
[15] Nas4Nas has 1379 votes
[16] Hier_my_spa! has 1298 votes
[17] Watch Agent has 1130 votes
[18] LiftSync has 1053 votes
[19] WooU has 1005 votes
[20] Giftr has 909 votes
[21] FLNT has 659 votes
[22] Tencil has 616 votes
[23] Taker has 596 votes
[24] HidroBrain has 522 votes

有关详细信息,请参阅此演示文稿:

http://www.slideshare.net/scrapinghub/xpath-for-web-scraping