在scrapy中使用其他文本数据作为字符串提取嵌套标记

时间:2016-09-15 12:31:52

标签: python scrapy

我正在尝试提取数据。 这是html的具体部分 -

     <div class="readable">

        <span id="freeTextContainer2123443890291117716">I write because I need to. <br>I review because I want to. 
    <br>I pay taxes because I have to. 
    <br><br>If you want to follow me, my username is @colleenhoover pretty much everywhere except my email, which is colleenhooverbooks@gmail.com
    <br><br>Founder of 
<a target="_blank" href="http://www.thebookwormbox.com" rel="nofollow">www.thebookwormbox.com</a> 
<br><br></span>

    </div>

我想要这样的输出 -

    I write because I need to.
    I review because I want to.
    I pay taxes because I have to.

    If you want to follow me, my username is @colleenhoover pretty much everywhere except my email, which is colleenhooverbooks@gmail.com 
Founder of www.thebookwormbox.com 

我正在尝试这个 -

aboutauthor=response.xpath('//div[@id="aboutAuthor"]/div[@class="bigBoxBody"]/div[@class="bigBoxContent containerWithHeaderContent"]/div[@class="readable"]/span[1]/text()').extract() if len(response.xpath('//div[@id="aboutAuthor"]/div[@class="bigBoxBody"]/div[@class="bigBoxContent containerWithHeaderContent"]/div[@class="readable"]/span')) == 1 else  response.xpath('//div[@id="aboutAuthor"]/div[@class="bigBoxBody"]/div[@class="bigBoxContent containerWithHeaderContent"]/div[@class="readable"]/span[2]/text()').extract()
    print aboutauthor

获得输出 -

[u'I write because I need to. ', u'I review because I want to. ', u'I pay taxes
because I have to. ', u'If you want to follow me, my username is @colleenhoover
pretty much everywhere except my email, which is colleenhooverbooks@gmail.com',
u'Founder of ', u' ']

我这样做,我得到www.thebookwormbox.com输出?

1 个答案:

答案 0 :(得分:2)

根据我的评论,您可以使用带有//text()的xpath来获取所有孩子的文字内容。