Question

我一直在尝试从下面提取单词字符中的单词：

<div class="text">hello there 234 44</div>

这是我正在做的事情：

regex_name = re.compile(r'<div class="text">([^\d].+)</div>')

Answer 1

作为一个起点，我使用BeautifulSoup HTML parser在HTML输入中找到所需的元素并提取元素的文本。

然后，我会使用itertools.takewhile()来获取字符串中的所有字符，直到满足数字：

In [1]: from itertools import takewhile

In [2]: from bs4 import BeautifulSoup

In [3]: data = """<div class="text">hello there 234 44</div>"""

In [4]: soup = BeautifulSoup(data, "html.parser")

In [5]: text = soup.find("div", class_="text").get_text()

In [6]: ''.join(takewhile(lambda x: not x.isdigit(), text))
Out[6]: u'hello there '

Answer 2

您可能想要使用 positive look-behind 断言

import re

s = """<div class="text">A hawking party 64 x 48 1/2in (163 x 123.3cm)</div>"""
r = r"(?<=\">)[^\d]+"
o = re.findall(r, s)
print o
# ['A hawking party ']

参见 regex demo

python （demo）

{{1}}

Answer 3

data = '<div class="text">A hawking party 64 x 48 1/2in (163 x 123.3cm)</div>'
final =''
for i in data.replace('<div class="text">','').replace('</div>',''):
    if not i.isdigit():
        final+= i
    else:
        break
print final

结果

A hawking party

在html标记之间提取非数字字符

3 个答案: