使用scrapy提取多个选择器

时间:2018-12-18 16:04:06

标签: python scrapy

我具有以下结构的html文件:

bool animStart = false;
void CreatePoints()
{
    float x;
    float z;

    float angle = 20;

    switch (circleheight)
    {
        case CircleHeight.Center:
            height = 0;
            break;
        case CircleHeight.Bottom:
            height = Bottom;
            break;
        case CircleHeight.Top:
            height = Top;
            break;
    }

    if (animateCircle)
    {
        if (animStart == false)
        {
            height = Mathf.Lerp(0, Top, t);
            t += animationSpeed * Time.deltaTime;
            if (height == Top)
                animStart = true;
        }
        else
        {
            height = Mathf.Lerp(Bottom, Top, t);
            t += animationSpeed * Time.deltaTime;

            if (t > 1.0f)
            {
                float temp = Top;
                Top = Bottom;
                Bottom = temp;
                t = 0.0f;
            }
        }
    }

    for (int i = 0; i < (segments + 1); i++)
    {
        x = Mathf.Sin(Mathf.Deg2Rad * angle) * xradius;
        z = Mathf.Cos(Mathf.Deg2Rad * angle) * yradius;

        line.SetPosition(i, new Vector3(x, height, z));

        angle += (360f / segments + 1);
    }
}

我想提取<div class='past_financing section'><div class="section dsss17 startups-show-sections fpg76 past_financing _a _jm" data-id="32319" data-_tn="startups/show/sections/past_financing"><div data-id="32319" class=" dsss17 startups-show-sections fss49 startup_rounds _a _jm" data-_tn="startups/show/sections/startup_rounds"><ul class='startup_rounds with_rounds'><li class='first not_editing startup_round'> <div data-id="56738" class=" dsr49 fpe53 _a _jm" data-_tn="startup_rounds/profile"><div class='show section'> <div class='details inner_section'> <div class='header'> <div class='type'> Series A </div> </div> <div class='raised'> $1,500,000 </div> </div> </div> </div> </li><li class='first not_editing startup_round'> <div data-id="72884" class=" dsr49 fpe53 _a _jm" data-_tn="startup_rounds/profile"><div class='show section'> <div class='g-sash green left'> <div class='copy'>Exit</div> </div> <div class='details inner_section'> <div class='header'> <div class='type'> Acquired by Travora Media - New York, NY </div> <div class='date_display'>Apr 1, 2012</div> </div> <div class='raised unknown'> Unknown </div> </div> <div class='participant_list inner_section'> <div class='participant g-lockup'> <div class='photo'> <a class="startup-link" title="Travora Media - New York, NY" data-type="Startup" data-id="242501" href="https://angel.co/travora-media-new-york-ny"><img class="angel_image" alt="Travora Media - New York, NY" src="https://angel.co/images/shared/nopic_startup.png" /></a> </div> <div class='text'> <div class='name'> <a class="startup-link" data-type="Startup" data-id="242501" href="https://angel.co/travora-media-new-york-ny">Travora Media - New York, NY</a> </div> <div class='tags'> </div> </div> </div> </div> </div> </div> </li><li class='first not_editing startup_round'> <div data-id="12714" class=" dsr49 fpe53 _a _jm" data-_tn="startup_rounds/profile"><div class='show section'> <div class='details inner_section'> <div class='header'> <div class='type unknown'> No Stage </div> <div class='date_display'>Dec 3, 2010</div> </div> <div class='raised'> <a rel="nofollow" target="_blank" href="http://venturebeat.com/2010/12/03/nileguide-funding/">$3,500,000</a> </div> <a class="read_press" rel="nofollow" target="_blank" href="http://venturebeat.com/2010/12/03/nileguide-funding/">Read Press</a> </div> </div> </div> </li> 中每个li元素的信息。我正在寻找本轮中的创始轮数据,日期和资金。

到目前为止,我带有以下xPath表达式:

ul

虽然可以很好地作为输出,但我得到两个分开的列表

    funding_round = '//div[@class="past_financing section"]/div/div/ul[@class="startup_rounds with_rounds"]/li/div/div/div/div/div[@class="type"]/text()'
    funding_date = '//div[@class="past_financing section"]/div/div/ul[@class="startup_rounds with_rounds"]/li/div/div/div/div/div[@class="date_display"]/text()'

    founders_url = response.xpath(founder_url_path).extract()
    founder_name = response.xpath(founder_name).extract()

串联也无济于事。

['\nSeries A\n', '\nAcquired\nby Travora Media - New York, NY\n', '\nSeries B\n', ]

['Apr  1, 2012', 'Dec  3, 2010', 'Jun  5, 2008']

问题在于网站结构不一致,某些 funding_round = response.xpath(funding_round).extract() + response.xpath(funding_date).extract() 元素没有有关日期或金钱的信息。我最好要从一个查询中检索一个元组对象。

最终列表应如下所示:

li

是否可以使用scrapy?

2 个答案:

答案 0 :(得分:1)

您有多个选项,我将使用css选择器进行简化,但使用xpath也是一样。

1)如果两个列表的大小相同,则可以使用zip:

titles = response.css('li .type::text').extract()
raised = response.css('li .raised::text').extract()
list(zip(titles, raised))
>>> [('\nSeries A\n', '\n$1,500,000\n'),
     ('\nAcquired\nby Travora Media - New York, NY\n', '\nUnknown\n'),
     ('\nNo Stage\n', '\n')]

2)如果列表的大小不同,则可以遍历li元素:

for li_selector in response.css('li'):
    title =  li_selector.css('.type::text').extract_first()
    raised = li_selector.css('.raised::text').extract_first()
    # use title and raised vars

请注意,在这种情况下,应使用extract_first仅获取第一个元素,而不要获取元素列表。另外,某些提取的值将为None,否则,两个列表的大小将相同。

答案 1 :(得分:0)

您是否不能在'//div[@class="past_financing section"]/div/div/ul[@class="startup_rounds with_rounds"]/li/div/div/div/div/'处停止路径(甚至看起来要高出一步),然后处理/解析相应的html片段?
其他答案的第一部分不太可能起作用,但是第二部分在li级别上看起来不错!