我正在尝试使用lxml从锚链接中找到所有图像(.png,.bmp,.jpg)和可执行文件(.exe)。从这个similar thread,接受的答案建议做这样的事情:
png = tree.xpath("//div/ul/li//a[ends-with(@href, '.png')]")
bmp = tree.xpath("//div/ul/li//a[ends-with(@href, '.bmp')]")
jpg = tree.xpath("//div/ul/li//a[ends-with(@href, '.jpg')]")
exe = tree.xpath("//div/ul/li//a[ends-with(@href, '.exe')]")
然而,我不断收到此错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2095, in lxml.etree._ElementTree.xpath (src/lxml/lxml.etree.c:53597)
File "xpath.pxi", line 373, in lxml.etree.XPathDocumentEvaluator.__call__ (src/lxml/lxml.etree.c:134052)
File "xpath.pxi", line 241, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:132625)
File "xpath.pxi", line 226, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:132453)
lxml.etree.XPathEvalError: Unregistered function
我正在通过pip运行lxml 3.2.4。
此外,有没有一种方法可以使用xpath并一次指定所有四个文件扩展名,而不是为每个文件扩展名定义xpath 4次?
答案 0 :(得分:3)
ends-with
是为XPath 2.0,XQuery 1.0和XSLT 2.0定义的函数,而lxml仅支持XPath 1.0,XSLT 1.0和EXSLT扩展。所以你不能使用这个功能。该文档为here和here。
您可以在XPATH中使用正则表达式。以下是返回与正则表达式匹配的节点的示例代码:
regexpNS = 'http://exslt.org/regular-expressions'
tree.xpath("//a[re:test(@href, '(png|bmp|jpg|exe)$')]", namespaces={'re':regexpNS}")
以下是类似问题Python, XPath: Find all links to images和regular-expressions-in-xpath
答案 1 :(得分:0)
我认为这是外部库无法识别ends-with
功能的问题。 documentation discusses working with links。{{3}}。我认为更好的解决方案是:
from urlparse import urlparse
tree.make_links_absolute(base_href='http://example.com/')
links = []
for i in tree.iterlinks():
url = urlparse(i[2]) # ensures you are getting the remote file path
if url.path.endswith('.png') or url.path.endswith('.exe') ... :
# there are other ways you could filter the links here
links.append(i[2])