Question

在python中有点困境。我想带一个带有很多注释的.txt文件并将其拆分成一个列表。但是，我想分割所有标点符号，空格和\ n。当我运行以下python代码时，它将我的文本文件分成奇怪的位置。注意：下面我只是试图分割期间和结束日来测试它。但它仍然经常摆脱文字中的最后一个字母。

import regex as re
with open('G:/My Documents/AHRQUnstructuredComments2.txt','r') as infile:
    nf = infile.read()
    wList = re.split('. | \n, nf)

print(wList)

Answer 1

您需要修复引号并对正则表达式稍作修改：

import regex as re
with open('G:/My Documents/AHRQUnstructuredComments2.txt','r') as infile:
    nf = infile.read()
    wList = re.split('\W+' nf)

print(wList)

Answer 2

你忘记关闭字符串了，之前你需要\。

import regex as re
with open('G:/My Documents/AHRQUnstructuredComments2.txt','r') as infile:
    nf = infile.read()
    wList = re.split('\. |\n |\s', nf)

print(wList)

有关详细信息，请参阅Split Strings with Multiple Delimiters?。

此外，RichieHindle完美地回答了您的问题：

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

Answer 3

在正则表达式中，字符.表示任何字符。您必须将其\.转义为捕获句点。

Python中的`re.split（）`工作奇怪

3 个答案: