我想得到“id”,它是锚标记中的对应值。
<li id="1" class="list">
<a class="tim">This is Link1</a>
<li id="2" class="list">
<a class="tim">This is Link2</a>
<li id="3" class="list">
<a class="tim">This is Link3</a>
我尝试使用以下代码:
from scrapy.http import HtmlResponse
response = HtmlResponse(url="some url", body=htmltext, encoding='utf8')
for x in response.css('li::attr(id)').extract():
item = {}
item['id'] = x
item['value'] = x.css('a.tim::text').extract()
但是它为最后一行提供了AttributeError: 'unicode' object has no attribute 'css'
。
答案 0 :(得分:1)
extract()
提取属性的值,因此您有一个属性值列表:
>>> response.css('li::attr(id)').extract()
['1', '2', '3']
不要提取然后循环,您需要选择li
元素(而不是属性),然后循环遍历Selector
个实例:
for x in response.css('li[id]'): # li elements that have an id attribute
item = {
'id': x.css('::attr(id)').extract_first(),
'value': x.css('a.tim::text').extract_first(),
}
这会生成一个包含所需id
和value
属性的字典:
>>> for x in response.css('li[id]'): # li elements that have an id attribute
... item = {
... 'id': x.css('::attr(id)').extract_first(),
... 'value': x.css('a.tim::text').extract_first(),
... }
... print(item)
...
{'id': '1', 'value': 'This is Link1'}
{'id': '2', 'value': 'This is Link2'}
{'id': '3', 'value': 'This is Link3'}