我想从以下html代码中提取:
<li>
<a test="test" href="abc.html" id="11">Click Here</a>
"for further reference"
</li>
我正在尝试使用以下提取命令
response.css("article div#section-2 li::text").extract()
但它只给出了“进一步参考”这一行 并且预期输出为“单击此处以供进一步参考”作为一个字符串。 这该怎么做? 如果有以下模式,如何修改它以执行相同操作:
答案 0 :(得分:4)
至少有几种方法可以做到这一点:
让我们首先构建一个模仿您的响应的测试选择器:
>>> response = scrapy.Selector(text="""<li>
... <a test="test" href="abc.html" id="11">Click Here</a>
... "for further reference"
... </li>""")
第一个选项,对CSS选择器进行微小更改。
查看所有文本后代,而不仅仅是文本子项(注意li
和::text
伪元素之间的空格):
# this is your CSS select,
# which only gives direct children text of your selected LI
>>> response.css("li::text").extract()
[u'\n ', u'\n "for further reference"\n']
# notice the extra space
# here
# |
# v
>>> response.css("li ::text").extract()
[u'\n ', u'Click Here', u'\n "for further reference"\n']
# using Python's join() to concatenate and build the full sentence
>>> ''.join(response.css("li ::text").extract())
u'\n Click Here\n "for further reference"\n'
另一种选择是在随后的.css()
电话中将您的.xpath()
来电与XPath 1.0 string()
或normalize-space()
联系起来:
>>> response.css("li").xpath('string()').extract()
[u'\n Click Here\n "for further reference"\n']
>>> response.css("li").xpath('normalize-space()').extract()
[u'Click Here "for further reference"']
# calling `.extract_first()` gives you a string directly, not a list of 1 string
>>> response.css("li").xpath('normalize-space()').extract_first()
u'Click Here "for further reference"'
答案 1 :(得分:0)
如果选择器是这样的话,我使用xpath:
response.xpath('//article/div[@id="section-2"]/li/a/text()').extract()#this will give you text of mentioned hyper link >> "Click Here"
response.xpath('//article/div[@id="section-2"]/li/a/@href').extract()#this will give you link of mentioned hyper link >> "abc.html"
response.xpath('//article/div[@id="section-2"]/li/text()').extract()#this will give you text of li >> "for further reference"