如何在scrapy中提取文本以及超链接文本?

时间:2017-04-10 12:52:55

标签: csv web-scraping scrapy

我想从以下html代码中提取:

<li>
    <a test="test" href="abc.html" id="11">Click Here</a>
    "for further reference"
</li>

我正在尝试使用以下提取命令

response.css("article div#section-2 li::text").extract()

但它只给出了“进一步参考”这一行 并且预期输出为“单击此处以供进一步参考”作为一个字符串。 这该怎么做? 如果有以下模式,如何修改它以执行相同操作:

  1. 文字超链接文字
  2. 超链接文字
  3. 文字超链接

2 个答案:

答案 0 :(得分:4)

至少有几种方法可以做到这一点:

让我们首先构建一个模仿您的响应的测试选择器:

>>> response = scrapy.Selector(text="""<li>
...     <a test="test" href="abc.html" id="11">Click Here</a>
...     "for further reference"
... </li>""")

第一个选项,对CSS选择器进行微小更改。 查看所有文本后代,而不仅仅是文本子项(注意li::text伪元素之间的空格):

# this is your CSS select,
# which only gives direct children text of your selected LI
>>> response.css("li::text").extract()    
[u'\n    ', u'\n    "for further reference"\n']

# notice the extra space
#                 here
#                   |
#                   v
>>> response.css("li ::text").extract()
[u'\n    ', u'Click Here', u'\n    "for further reference"\n']

# using Python's join() to concatenate and build the full sentence
>>> ''.join(response.css("li ::text").extract())
u'\n    Click Here\n    "for further reference"\n'

另一种选择是在随后的.css()电话中将您的.xpath()来电与XPath 1.0 string()normalize-space()联系起来:

>>> response.css("li").xpath('string()').extract()
[u'\n    Click Here\n    "for further reference"\n']
>>> response.css("li").xpath('normalize-space()').extract()
[u'Click Here "for further reference"']

# calling `.extract_first()` gives you a string directly, not a list of 1 string
>>> response.css("li").xpath('normalize-space()').extract_first()
u'Click Here "for further reference"'

答案 1 :(得分:0)

如果选择器是这样的话,我使用xpath:

response.xpath('//article/div[@id="section-2"]/li/a/text()').extract()#this will give you text of mentioned hyper link >> "Click Here"
response.xpath('//article/div[@id="section-2"]/li/a/@href').extract()#this will give you link of mentioned hyper link >> "abc.html"
response.xpath('//article/div[@id="section-2"]/li/text()').extract()#this will give you text of li >> "for further reference"