我正在阅读this网站并学习如何使用from webbrowser import get
和“请求”制作网络漫画器。这是webscraper代码:
lxml
它按预期工作,但当我尝试为所有链接抓取https://www.reddit.com/r/cringe/时,我只是得到from lxml import html
import requests
web_page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(web_page.content)
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
prices = tree.xpath('//span[@class="item-price"]/text()')
print "These are the buyers: ", buyers
print "And these are the prices: ", prices
:
[]
我正在使用的xpath有什么问题?我无法弄清楚要放在xpath的方括号中的内容
答案 0 :(得分:2)
首先,您的xpath是错误的,没有带有 data-url 的类,它是属性所以您需要div[@data-url]
并提取您将使用的属性/@data-url
:
from lxml import html
import requests
headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.92 Safari/537.36"}
web_page = requests.get("https://www.reddit.com/r/cringe/", headers=headers)
tree = html.fromstring(web_page.content)
links = tree.xpath('//div[@data-url]/@data-url')
print links
如果您经常查询或不使用用户代理,您可能会看到类似以下内容的html,因此请尊重他们推荐的内容:
<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>
<p>please wait 6 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no
more than <a href="http://github.com/reddit/reddit/wiki/API">one
request every two seconds</a> to avoid seeing this message.</p>
</body>
</html>
如果您计划抓取大量reddit,可能需要查看PRAW,而w3schools可以很好地介绍 xpath 表达式。
要打破它:
//div[@data-url]
在文档中搜索具有属性data-url
的 div ,我们不关心属性值是什么,我们只想要div。
只需找到 div&#39> ,如果你删除了 / @ data-url ,你最终会得到一个元素列表,如:
[<Element div at 0x7fbb27a9a940>, <Element div at 0x7fbb27a9a8e8>,..
/@data-url
实际上提取了 attrbute值,即 hrefs 。
此外,您只需要特定的链接, youtube 链接,您可以使用包含进行过滤:
'//div[contains(@data-url, "www.youtube.com")]/@data-url'
contains(@data-url, "www.youtube.com")
会检查 data-url 属性值是否包含 www.youtube.com ,因此输出将是 youtube的列表链接。