Question

我是菜鸟，想要使用scrapy框架来抓取东西，但我遇到了麻烦：

Html A：

<ul class="tip" id="tip1">
    <li id="tip1_0">
        <a href="http://***" title="***" target="_self">***
        </a>
    </li>
    <li id="tip1_1">
        <a href="http://***" title="***" target="_self">***
        </a>
    </li>
    <li id="tip1_2">
        <a href="http://***" title="***" target="_self">***
        </a>
    </li>
</ul>

我用：

f = response.xpath("//*[@id='tip1']//li/a/@href | //*[@id='tip1']//li/a/@title").extract()

当我得到f是一个列表时，我将列表（f）更改为dict（name0 = f [0]，value0 = f [1]，name1 = f [2]，value1 = [f3] ，等等）。有没有办法更轻松？

Html B：

<div class="info">
    <a target="_blank" href="***" title="***">
    </a>
</div>
<div class="info">
    <a target="_blank" href="***" title="***">
    </a>
</div>
<div class="info">
    <a target="_blank" href="***" title="***">
    </a>
</div>

在这种情况下：

file = response.xpath('//div[@class="info"]')
for line in file:
    f = line.xpath('/a/@href').extract()
    d = line.xpath('/a/@title').extract()

但是，它不起作用，只返回'f = []'和'd = []'，所以，我很困惑，我怎么能解决这个问题呢？非常感谢。

Answer 1

你可以通过预先添加点来使你的内部表达式特定于上下文：

f = line.xpath('./a/@href').extract()
d = line.xpath('./a/@title').extract()

或者，将您的外表达指向a并获取@href和@title：

file = response.xpath('//div[@class="info"]/a')
for line in file:
    f = line.xpath('@href').extract_first()
    d = line.xpath('@title').extract_first()

另请注意使用extract_first()方法。

如何在函数中使用Xpath和CSS选择器

1 个答案: