Question

我正在阅读this网站并学习如何使用from webbrowser import get和“请求”制作网络漫画器。这是webscraper代码：

lxml

它按预期工作，但当我尝试为所有链接抓取https://www.reddit.com/r/cringe/时，我只是得到from lxml import html import requests web_page = requests.get('http://econpy.pythonanywhere.com/ex/001.html') tree = html.fromstring(web_page.content) buyers = tree.xpath('//div[@title="buyer-name"]/text()') prices = tree.xpath('//span[@class="item-price"]/text()') print "These are the buyers: ", buyers print "And these are the prices: ", prices：

[]

我正在使用的xpath有什么问题？我无法弄清楚要放在xpath的方括号中的内容

Answer 1

首先，您的xpath是错误的，没有带有 data-url 的类，它是属性所以您需要div[@data-url]并提取您将使用的属性/@data-url：

from lxml import html
import requests

headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.92 Safari/537.36"}
web_page = requests.get("https://www.reddit.com/r/cringe/", headers=headers)

tree = html.fromstring(web_page.content)

links = tree.xpath('//div[@data-url]/@data-url')

print links

如果您经常查询或不使用用户代理，您可能会看到类似以下内容的html，因此请尊重他们推荐的内容：

<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>

<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>

<p>please wait 6 second(s) and try again.</p>

    <p>as a reminder to developers, we recommend that clients make no
    more than <a href="http://github.com/reddit/reddit/wiki/API">one
    request every two seconds</a> to avoid seeing this message.</p>
  </body>
</html>

如果您计划抓取大量reddit，可能需要查看PRAW，而w3schools可以很好地介绍 xpath 表达式。

要打破它：

//div[@data-url]

在文档中搜索具有属性data-url的 div ，我们不关心属性值是什么，我们只想要div。

只需找到 div＆＃39> ，如果你删除了 / @ data-url ，你最终会得到一个元素列表，如：

[<Element div at 0x7fbb27a9a940>, <Element div at 0x7fbb27a9a8e8>,..

/@data-url实际上提取了 attrbute值，即 hrefs 。

此外，您只需要特定的链接， youtube 链接，您可以使用包含进行过滤：

'//div[contains(@data-url, "www.youtube.com")]/@data-url'

contains(@data-url, "www.youtube.com")会检查 data-url 属性值是否包含 www.youtube.com ，因此输出将是 youtube的列表链接。

在尝试使用webscraper来解析网站链接时接收并清空列表

1 个答案: