在尝试使用webscraper来解析网站链接时接收并清空列表

时间:2016-09-12 04:57:09

标签: python html web-scraping lxml

我正在阅读this网站并学习如何使用from webbrowser import get和“请求”制作网络漫画器。这是webscraper代码:

lxml

它按预期工作,但当我尝试为所有链接抓取https://www.reddit.com/r/cringe/时,我只是得到from lxml import html import requests web_page = requests.get('http://econpy.pythonanywhere.com/ex/001.html') tree = html.fromstring(web_page.content) buyers = tree.xpath('//div[@title="buyer-name"]/text()') prices = tree.xpath('//span[@class="item-price"]/text()') print "These are the buyers: ", buyers print "And these are the prices: ", prices

[]

我正在使用的xpath有什么问题?我无法弄清楚要放在xpath的方括号中的内容

1 个答案:

答案 0 :(得分:2)

首先,您的xpath是错误的,没有带有 data-url 的类,它是属性所以您需要div[@data-url]并提取您将使用的属性/@data-url

from lxml import html
import requests

headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.92 Safari/537.36"}
web_page = requests.get("https://www.reddit.com/r/cringe/", headers=headers)

tree = html.fromstring(web_page.content)

links = tree.xpath('//div[@data-url]/@data-url')

print links

如果您经常查询或不使用用户代理,您可能会看到类似以下内容的html,因此请尊重他们推荐的内容:

<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>

<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>

<p>please wait 6 second(s) and try again.</p>

    <p>as a reminder to developers, we recommend that clients make no
    more than <a href="http://github.com/reddit/reddit/wiki/API">one
    request every two seconds</a> to avoid seeing this message.</p>
  </body>
</html>

如果您计划抓取大量reddit,可能需要查看PRAW,而w3schools可以很好地介绍 xpath 表达式。

要打破它:

//div[@data-url]

在文档中搜索具有属性data-url div ,我们不关心属性值是什么,我们只想要div。

只需找到 div&#39> ,如果你删除了 / @ data-url ,你最终会得到一个元素列表,如:

[<Element div at 0x7fbb27a9a940>, <Element div at 0x7fbb27a9a8e8>,..

/@data-url实际上提取了 attrbute值,即 hrefs

此外,您只需要特定的链接, youtube 链接,您可以使用包含进行过滤:

'//div[contains(@data-url, "www.youtube.com")]/@data-url'

contains(@data-url, "www.youtube.com")会检查 data-url 属性值是否包含 www.youtube.com ,因此输出将是 youtube的列表链接。