Question

我想知道如何使用python从一些html代码中获取某些html标记之间的值。

说我想在亚马逊页面上获得产品的价格：

我已经到了：

url = raw_input("Enter the url:\n")
sock = urllib.urlopen(url)
htmlsource = sock.read()
sock.close()

所以现在我把html源码作为字符串，但我不知道如何提取价格。我玩过re.search但是无法得到正确的表达。

说价格介于<span class="price">£79.98</span>

之间

获得var1 = 79.98的最佳方式是什么？

Answer 1

您需要使用HTML解析库。它提供了比使用标准正则表达式更好的功能，您可以轻松地出错并且难以维护。 Python标准库在py3k中附带html.parse，在python2.x系列中附带HTMLParser，它可以帮助您解析HTML文件并获取标记的值。

您也可以使用许多易于使用的BeautifulSoup库。

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<span class="price">79.98</span>')
t = soup.find('span', attrs={"class":"price"})
print t.renderContents()

Answer 2

通过正则表达式解析html是讨厌的，容易出错，而且通常是邪恶的。

import lxml.html

url = raw_input("Enter the url:\n")
root = lxml.html.parse(url).getroot()
res = root.xpath('//span[@class="price"]/text()') or []

print res

返回类似

的内容

['\xc2\xa379.98', '\xc2\xa389.98', '\xc2\xa399.98']

现在我们正在处理普通字符串，应该使用正则表达式，

import re

def getPrice(s):
    res =  re.search(r'\d+\.\d+', s)
    if res is None:
        return 0.
    else:
        return float(res.group(0))

prices = map(getPrice, res)
print prices

结果

[79.98, 89.98, 99.98]

Answer 3

作为BeautifulSoup的替代品，您可以尝试lxml。这是一个comparison of the two from the lxml website。

在python中获取部分html代码作为新字符串

3 个答案: