Question

我想像这样拆分字符串：

string = '[[he (∇((comesΦf→chem,'

基于空格，标点符号也是unicode字符。我的意思是，我对输出的期望是以下模式：

out= ['[', '[', 'he',' ', '(','∇' , '(', '(', 'comes','Φ', 'f','→', 'chem',',']

我正在使用

re.findall(r"[\w\s\]+|[^\w\s]",String,re.unicode)

对于这种情况，但它返回以下输出：

output=['[', '[', 'he',' ', '(', '\xe2', '\x88', '\x87', '(', '(', 'comes\xce', '\xa6', 'f\xe2', '\x86', '\x92', 'chem',',']

请告诉我如何解决这个问题。

Answer 1

不使用正则表达式并假设单词只包含ascii字符：

from string import ascii_letters
from itertools import groupby

LETTERS = frozenset(ascii_letters)

def is_alpha(char):
    return char in LETTERS

def split_string(text):
    for key, tokens in groupby(text, key=is_alpha):
        if key: # Found letters, join them and yield a word
            yield ''.join(tokens)
        else:  # not letters, just yield the single tokens
            yield from tokens

示例结果：

In [2]: list(split_string('[[he (∇((comesΦf→chem,'))
Out[2]: ['[', '[', 'he', ' ', '(', '∇', '(', '(', 'comes', 'Φ', 'f', '→', 'chem', ',']

如果您使用的是低于3.3的python版本，则可以将yield from tokens替换为：

for token in tokens: yield token

如果您使用的是python2，请记住split_string接受 unicode 字符串。

请注意，修改is_alpha功能可以定义不同类型的分组。例如，如果您想将所有 unicode字母视为可以执行的字母：is_alpha = str.isalpha（或python2中的unicode.isalpha）：

In [3]: is_alpha = str.isalpha

In [4]: list(split_string('[[he (∇((comesΦf→chem,'))
Out[4]: ['[', '[', 'he', ' ', '(', '∇', '(', '(', 'comesΦf', '→', 'chem', ',']

注意之前被分割的'comesΦf'。

Answer 2

希望我停下来。

In [33]: string = '[[he (∇((comesΦf→chem,'

In [34]: re.split('\W+', string)
Out[34]: ['', 'he', 'comes', 'f', 'chem', '']

在python中拆分一个字符串，包含空格和标点符号，unicode字符等。

2 个答案: