Question

我想从以下html代码中提取：

<li>
    <a test="test" href="abc.html" id="11">Click Here</a>
    "for further reference"
</li>

我正在尝试使用以下提取命令

response.css("article div#section-2 li::text").extract()

但它只给出了“进一步参考”这一行并且预期输出为“单击此处以供进一步参考”作为一个字符串。这该怎么做？如果有以下模式，如何修改它以执行相同操作：

文字超链接文字
超链接文字
文字超链接

Answer 1

至少有几种方法可以做到这一点：

让我们首先构建一个模仿您的响应的测试选择器：

>>> response = scrapy.Selector(text="""<li>
...     <a test="test" href="abc.html" id="11">Click Here</a>
...     "for further reference"
... </li>""")

第一个选项，对CSS选择器进行微小更改。查看所有文本后代，而不仅仅是文本子项（注意li和::text伪元素之间的空格）：

# this is your CSS select,
# which only gives direct children text of your selected LI
>>> response.css("li::text").extract()    
[u'\n    ', u'\n    "for further reference"\n']

# notice the extra space
#                 here
#                   |
#                   v
>>> response.css("li ::text").extract()
[u'\n    ', u'Click Here', u'\n    "for further reference"\n']

# using Python's join() to concatenate and build the full sentence
>>> ''.join(response.css("li ::text").extract())
u'\n    Click Here\n    "for further reference"\n'

另一种选择是在随后的.css()电话中将您的.xpath()来电与XPath 1.0 string()或normalize-space()联系起来：

>>> response.css("li").xpath('string()').extract()
[u'\n    Click Here\n    "for further reference"\n']
>>> response.css("li").xpath('normalize-space()').extract()
[u'Click Here "for further reference"']

# calling `.extract_first()` gives you a string directly, not a list of 1 string
>>> response.css("li").xpath('normalize-space()').extract_first()
u'Click Here "for further reference"'

Answer 2

如果选择器是这样的话，我使用xpath：

response.xpath('//article/div[@id="section-2"]/li/a/text()').extract()#this will give you text of mentioned hyper link >> "Click Here"
response.xpath('//article/div[@id="section-2"]/li/a/@href').extract()#this will give you link of mentioned hyper link >> "abc.html"
response.xpath('//article/div[@id="section-2"]/li/text()').extract()#this will give you text of li >> "for further reference"

如何在scrapy中提取文本以及超链接文本？

2 个答案: