Question

我正在使用BeautifulSoup来抓取一个网址，我有以下代码

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})

现在在上面的代码中我们可以使用findAll来获取与它们相关的标签和信息，但我想使用xpath。是否可以将xpath与BeautifulSoup一起使用？如果可能，有人可以提供一个示例代码，以便更有帮助吗？

Answer 1

Nope，BeautifulSoup本身不支持XPath表达式。

替代库lxml，支持XPath 1.0。它有一个BeautifulSoup compatible mode，它会像Soup一样尝试解析破碎的HTML。但是，default lxml HTML parser在解析破坏的HTML方面做得很好，我相信速度更快。

将文档解析为lxml树后，可以使用.xpath()方法搜索元素。

import urllib2
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

您可能感兴趣的是CSS Selector support; CSSSelector类将CSS语句转换为XPath表达式，使您更轻松地搜索td.empformbody：

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

完整的圈子：BeautifulSoup本身非常完整CSS selector support：

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

Answer 2

我可以确认Beautiful Soup中没有XPath支持。

Answer 3

Martijn的代码不再正常运行（现在已经4年多了......），etree.parse()行打印到控制台并且没有将值赋给{ {1}}变量。引用this，我能够使用请求和lxml找出这个工作原理：

tree

Answer 4

BeautifulSoup有一个名为findNext的函数来自当前元素导向的childern，所以：

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')

上面的代码可以模仿以下xpath：

div[class=class_value]/div[id=id_value]

Answer 5

from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[@id="BGINP01_S1"]/section/div/font/text()')

上面使用了Soup对象和lxml的组合，并且可以使用xpath提取值

Answer 6

我已经搜索了他们的docs，似乎没有xpath选项。另外，正如您在SO上的类似问题上看到here，OP正在请求从xpath到BeautifulSoup的转换，所以我的结论是 - 不，没有可用的xpath解析。

Answer 7

这是一个非常古老的主题，但现在有一个解决方案，当时可能没有在BeautifulSoup中。

以下是我所做的一个例子。我使用＆＃34;请求＆＃34;模块读取RSS源并在名为＆＃34; rss_text＆＃34;的变量中获取其文本内容。有了它，我通过BeautifulSoup运行它，搜索xpath / rss / channel / title，并检索其内容。它并不完全是XPath的所有荣耀（通配符，多路径等），但是如果你只想要找到一个基本路径，那就行了。

from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()

Answer 8

使用lxml时都很简单：

tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[@class="shared-components"]/@href')

但是使用BeautifulSoup BS4时也非常简单：

首先删除“ //”和“ @”
秒-在“ =“

尝试这个魔术：

soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')

如您所见，它不支持子标签，因此我删除了“ / @ href”部分

Answer 9

也许您可以在没有XPath的情况下尝试以下操作

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''
<html>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))

Answer 10

使用soup.find(class_='myclass')

我们可以在BeautifulSoup中使用xpath吗？

10 个答案: