用Scrapy解析流浪文本

时间:2018-02-22 04:58:52

标签: python web-scraping scrapy scrapy-spider

任何想法如何提取' TEXT TO GRAB'从这一标记:

<span class="navigation_page">
    <span>
        <a itemprop="url" href="http://www.example.com">
            <span itemprop="title">LINK</span>
        </a>
    </span>
    <span class="navigation-pipe">&gt;</span>
    TEXT TO GRAB
</span>

2 个答案:

答案 0 :(得分:2)

不理想:

text_to_grab = response.xpath('//span[@class="navigation-pipe"]/following-sibling::text()[1]').extract_first()

答案 1 :(得分:1)

这不是一个理想的解决方案,但应该可以解决这个问题:

select t1.[name], t2.[name]
    from #tmp t1
    left join your_table t2 on t1.[name] like '%' + t2.[name] + '%'
    order by t1.[name], t2.[name]

或者像这样:

              Timestamp                  Value
0   2017-11-22 09:00:00                 12.356965
1   2017-11-22 10:00:00                 26.698426
2   2017-11-22 11:00:00                 13.153104
3   2017-11-22 12:00:00                 15.425182
4   2017-11-22 13:00:00                 15.161085
5   2017-11-22 14:00:00                 17.038580
6   2017-11-22 15:00:00                 11.035375
7   2017-11-22 16:00:00                  5.208686
8   2017-11-22 17:00:00                  6.026359
9   2017-11-22 18:00:00                  6.259712
10  2017-11-22 19:00:00                 21.792882
11  2017-11-22 20:00:00                  9.053889

输出:

from scrapy import Selector

content="""
<span class="navigation_page">
    <span>
        <a itemprop="url" href="http://www.example.com">
            <span itemprop="title">LINK</span>
        </a>
    </span>
    <span class="navigation-pipe">&gt;</span>
    TEXT TO GRAB
</span>
"""
sel = Selector(text=content)
item = sel.css(".navigation_page::text")
print(item.extract()[-1].strip())