我有以下代码,我想用正则表达式
标记位于我的目录中的文本def tokenize():
infile = codecs.open('test_test.txt', 'r', encoding='utf-8')
text = infile.read()
infile.close()
words = []
with io.open('test_test.txt', 'r', encoding='utf-8') as csvfile:
text = unicode_csv_reader(csvfile, delimiter=',', quotechar='"')
for item in text:
for word in item:
words.append(word)
tregex = re.compile(ur'[?&/\'\r\n]', re.IGNORECASE)
newtext1 = tregex.sub(' ', text)
newtext = re.sub(' +', ' ', newtext1)
words = re.split(r' ', newtext)
print words
但我收到此错误
Traceback (most recent call last):
File "D:\KKSC\KKSC.py", line 150, in OnCheckSpell
tokenize()
File "D:\KKSC\KKSC.py", line 32, in tokenize
newtext1 = tregex.sub(' ', text)
TypeError:期望的字符串或缓冲区
答案 0 :(得分:0)
newtext1 = tregex.sub(' ', text)
text
是一个二维字符串数组,而sub
需要一个字符串。你的意思是:
newtext1 = tregex.sub(' ', word) ?