Question

我试图从一个句子中分割单词，标点符号和数字。但是，我的代码产生的输出不是预期的。我该如何解决？

这是我的输入文字（在文本文件中）：

 "I 2changed to ask then, said that mildes't of men2,

我的代码输出：

['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men2']

但是，预期的输出是：

 ['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men','2']

这是我的代码：

import re
newlist = []
f = open("Inputfile2.txt",'r')
out = f.readlines()
for line in out:
    word = line.strip('\n')
    f.close()
    lst = re.compile(r"\d|\w+[\w']+|\w|[^\w\s]").findall(word)
print(lst)

Answer 1

在正则表达式中，＆＃39; \ w＆＃39;匹配任何字母数字字符，即[a-zA-Z0-9]。

同样在正则表达式的第一部分，它应该是＆＃39; \ d +＆＃39;匹配多个数字。

正则表达式的第二部分和第三部分＆＃39; \ w + [\ w＆＃39;] + | \ w＆＃39;可以通过更改＆＃39; +＆＃39;来合并为一个部分到＆＃39; *＆＃39;。

import re
with open('Inputfile2.txt', 'r') as f:
    for line in f:
        word = line.strip('\n')
        lst = re.compile(r"\d+|[a-zA-Z]+[a-zA-Z']*|[^\w\s]").findall(word)
        print(lst)

这给出了：

['"', 'I', '2', 'changed', 'to', 'ask', 'then', ',', 'said', 'that', "mildes't", 'of', 'men', '2', ',']

请注意，您的预期输出不正确。它缺少一个＆＃39;，＃39;

使用正则表达式在Python中拆分句子

1 个答案: