考虑以下代码:
<div class="tag1">
<div>
<a class="tag11 tag12" href="http://www.example.com/file1" title="file1"><img class="tag2" src="http://www.example.com/img1.jpg" alt="textalt">linktext</a>
<span class="tag3">.</span>
</div>
<div>
<a class="tag11 tag12" href="http://www.example.com/file2" title="file2"><img class="tag2" src="http://www.example.com/img1.jpg" alt="textalt">linktext</a>
<span class="tag3">.</span>
</div>
这是一个较大的html页面的部分,其中包含带有其他标签的其他a
元素。但是,我只想 引用类为a
的{{1}}元素,并创建一个包含所有tag11 tag12
值的列表。 href
和tag11
之间有一个空格。
使用Python 3.5,tag12
和lxml
,这是第一次尝试:
xpath
但是它不起作用。使用单个顶点:
from lxml import html
import requests
page = requests.get('http://www.example.com/page.html')
tree = html.fromstring(page.content)
atest = tree.xpath('//a[contains(@class='tag11 tag12')]')
使用双重顶点:
File "<stdin>", line 1
buyers = tree.xpath('//a[contains(@class='tag11 tag12')]')
^
SyntaxError: invalid syntax
也(来自this answer):
tree.xpath('//a[contains(@class="tag11 tag12")]')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:61854)
File "src/lxml/xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:178516)
File "src/lxml/xpath.pxi", line 227, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:177421)
lxml.etree.XPathEvalError: Invalid number of arguments
获得一个空的atest = tree.xpath('//a[contains(@class, "tag11") and contains(@class, "tag12")]')
列表。
如何正确处理atest
标签中包含空格的a
个元素?
我正在使用Python 3.5,class
和lxml
,因为我正在尝试学习这些工具。因此,没有特别的理由不使用BeautifulSoup,但我只是在寻找这些列出的工具(如果有)的特定解决方案。
答案 0 :(得分:2)
是否有不使用BeautifulSoup4的理由?这是我的项目中的代码片段:
import urllib.request # You could use requests library as well
from bs4 import BeautifulSoup
url = 'http://www.example.com/page.html'
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
"AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/67.0.3396.87 Safari/537.36"}
soup = BeautifulSoup(urllib.request.urlopen(
urllib.request.Request(url, headers=header)),
'lxml')
links = list()
for link in soup.find_all('a', class_='tag1 tag2'):
links.append(link.get('href'))
答案 1 :(得分:2)
检查此XPath:'//a[@class="tag11 tag12"]/@href
'
from lxml import html
page = "<div class=\"tag1\"> <div> <a class=\"tag11 tag12\" href=\"http://www.example.com/file1\" title=\"file1\"><img class=\"tag2\" src=\"http://www.example.com/img1.jpg\" alt=\"textalt\">linktext</a> <span class=\"tag3\">.</span> </div> <div> <a class=\"tag11 tag12\" href=\"http://www.example.com/file2\" title=\"file2\"><img class=\"tag2\" src=\"http://www.example.com/img1.jpg\" alt=\"textalt\">linktext</a> <span class=\"tag3\">.</span> </div>"
tree = html.fromstring(page)
links = tree.xpath('//a[@class="tag11 tag12"]/@href')
for link in links:
print(link)
输出:
http://www.example.com/file1
http://www.example.com/file2
答案 2 :(得分:0)
请尝试以下方法与多个类一起玩。如果两个classes
同时存在,它将返回所需的输出:
from lxml.html import fromstring
content = """
<div class="tag1">
<div>
<a class="tag11 tag12" href="http://www.example.com/file1" title="file1"><img class="tag2" src="http://www.example.com/img1.jpg" alt="textalt">linktext</a>
<span class="tag3">.</span>
</div>
<div>
<a class="tag11 tag12" href="http://www.example.com/file2" title="file2"><img class="tag2" src="http://www.example.com/img1.jpg" alt="textalt">linktext</a>
<span class="tag3">.</span>
</div>
"""
tree = fromstring(content)
for atest in tree.xpath('//a[contains(@class, "tag11") and contains(@class, "tag12")]'):
print(atest.attrib['href'])
输出:
http://www.example.com/file1
http://www.example.com/file2