我收到Python的错误,我无法理解。我已将代码简化为最低限度:
response = requests.get('http://pycoders.com/archive')
tree = html.fromstring(response.text)
r = tree.xpath('//divass="campaign"]/a/@href')
print(r)
仍然出现错误
Traceback (most recent call last):
File "ultimate-1.py", line 17, in <module>
r = tree.xpath('//divass="campaign"]/a/@href')
File "lxml.etree.pyx", line 1509, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:50702)
File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:145954)
File "xpath.pxi", line 238, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:144962)
File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:144817)
lxml.etree.XPathEvalError: Invalid expression
有人会知道问题的来源吗?可能是依赖问题?感谢。
答案 0 :(得分:1)
表达式'//divass="campaign"]/a/@href'
在语法上不正确,没有多大意义。相反,您打算检查class
属性:
//div[@class="campaign"]/a/@href
现在,这将有助于避免无效表达式错误,但您不会得到表达式找不到任何内容。这是因为requests
收到的响应中没有数据。您需要模仿浏览器为获取所需数据所做的工作,并提出额外请求以获取包含广告系列的javascript文件。
这对我有用:
import ast
import re
import requests
from lxml import html
with requests.Session() as session:
# extract script url
response = session.get('http://pycoders.com/archive')
tree = html.fromstring(response.text)
script_url = tree.xpath("//script[contains(@src, 'generate-js')]/@src")[0]
# get the script
response = session.get(script_url)
data = ast.literal_eval(re.match(r'document.write\((.*?)\);$', response.content).group(1))
# extract the desired data
tree = html.fromstring(data)
campaigns = [item.attrib["href"].replace("\\", "") for item in tree.xpath('//div[@class="campaign"]/a')]
print(campaigns)
打印:
['http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=3384ab2140',
...
'http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=8b91cb0481'
]
答案 1 :(得分:0)
制作xpath时出错了。 如果你想要获取所有href,你的xpath应该是
hrefs = tree.xpath('//div[@class="campaign"]/a')
for href in hrefs:
print(href.get('href'))
或一行:
hrefs = [item.get('href') for item in tree.xpath('//div[@class="campaign"]/a')]