Question

我有一个很大的字符串 - html页面。我需要找到闪存驱动器的所有名称，即我需要在双引号之间获取内容：data-name="USB Flash-drive Leef Fuse 32Gb">。所以我需要data-name="和">之间的字符串。请不要提到BeautifulSoup，我需要在没有BeautifulSoup的情况下进行，没有正则表达式，但也可以使用正则表达式。

我试着用这个：

p = re.compile('(?<=")[^,]+(?=")')
result = p.match(html_str)
print(result)

但结果为无。但是在regex101.com上它有效：

Answer 1

py2：https://docs.python.org/2/library/htmlparser.html

py3：https://docs.python.org/3/library/html.parser.html

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        # tag = 'sometag'
        for attr in attrs:
            # attr = ('data-name', 'USB Flash-drive Leef Fuse 32Gb')
            if attr[0] == 'data-name':
                print(attr[1])

parser = MyHTMLParser()
parser.feed('<sometag data-name="USB Flash-drive Leef Fuse 32Gb">hello  world</sometag>')

输出：

USB Flash-drive Leef Fuse 32Gb

我在代码中添加了一些注释，以向您展示解析器返回的数据结构类型。

从这里构建起来应该很容易。

只需输入HTML，它就可以解析它。请参阅文档，并继续尝试。

Answer 2

如果你想用基本的python字符串解析，这是一种方式

s="html string"
start = s.find('data-name="')
end = s.find('">')
output = s[start:end]

这是我的python shell中发生的事情

>>> s='junk...data-name="USB Flash-drive Leef Fuse 32Gb">...junk'
>>> start = s.find('data-name="')
>>> end = s.find('">')
>>> output = s[start:end]
>>> output
'data-name="USB Flash-drive Leef Fuse 32Gb'

让我知道这部分脚本是否单独工作

蟒蛇。如何查找所有匹配的子串？

2 个答案: