Question

我尝试使用Python中的lxml来抓取网站上的特定元素。您可以在下面找到我的代码，但没有输出。

    from lxml import html

    webpage = 'http://www.funda.nl/koop/heel-nederland/'
    page = requests.get(webpage)
    tree = html.fromstring(page.content)

    content = '//*[@id="content"]/form/div[2]/div[5]/div/a[8]/text()'
    content = str(tree.xpath(content))
    print content

Answer 1

看起来你试图报废的网站不喜欢被废弃。他们利用各种技术来检测请求是来自合法用户，还是来自机器人和阻止访问，如果他们认为它来自机器人。这就是为什么你的xpath找不到任何东西的原因，以及为什么你应该重新考虑你正在做的事情。

如果您决定继续，那么欺骗这个特定网站的最简单方法似乎是为您的请求添加Cookie。

首先，使用真正的浏览器获取cookie字符串：

打开新标签
打开开发人员工具
转到＆＃34;网络＆＃34;开发人员工具中的选项卡
如果网络标签为空，请刷新页面
查找heel-nederland/的请求并点击
在请求标头中，您会发现cookie字符串 - 它很长并且包含许多看似随机的字符。复制它

然后，修改您的程序以使用这些cookie：

import requests
from lxml import html

webpage = 'http://www.funda.nl/koop/heel-nederland/'
headers = {
        'Cookie': '<string copied from browser>'
        }
page = requests.get(webpage, headers=headers)
tree = html.fromstring(page.content)

selector = '//*[@id="content"]/form/div[2]/div[5]/div/a[8]/text()'
content = str(tree.xpath(selector))
print content

Python lxml xpath不返回任何输出

1 个答案: