Question

我写了这样的代码：

print re.findall(r'(<td width="[0-9]+[%]?" align="(.+)">|<td align="(.+)"> width="[0-9]+[%]?")([ \n\t\r]*)([0-9,]+\.[0-9]+)([ \n\t\r]*)([&]?[a-zA-Z]+[;]?)([ \n\t\r]*)<span class="(.+)">',r.text,re.MULTILINE)

得到这一行：

<td width="47%" align="left">556.348&nbsp;<span class="uccResCde">

我想要值556.348。我如何使用regular expressions获取它？

Answer 1

来自the HTMLParser documentation的直剪和粘贴可以获取标签中的数据，但不会使用正则表达式。

from HTMLParser import HTMLParser

# Create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

# Instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<td width="47%" align="left">556.348&nbsp;<span class="uccResCde">')

Answer 2

这是一个解决方案，应该解释如何获得匹配的组。你应该阅读the documentation。

import re

text_to_parse= '<td width="47%" align="left">556.348&nbsp;<span class="uccResCde">'
pattern = r'(<td width="[0-9]+[%]?" align="(.+)">|<td align="(.+)"> width="[0-9]+[%]?")([ \n\t\r]*)([0-9,]+\.[0-9]+)([ \n\t\r]*)([&]?[a-zA-Z]+[;]?)([ \n\t\r]*)<span class="(.+)">'
m = re.search(pattern, text_to_parse)
m.group(5)

但是为了解析HTML，不需要使用正则表达式。相反，请使用HTML解析器，例如Beautiful Soup：

from bs4 import BeautifulSoup

soup = BeautifulSoup(text_to_parse)
soup.text

如何在Python中使用正则表达式检索值？

2 个答案: