Question

我有以下代码，我想用正则表达式

标记位于我的目录中的文本

def tokenize():
    infile = codecs.open('test_test.txt', 'r', encoding='utf-8')
    text = infile.read()
    infile.close()
    words = []
    with io.open('test_test.txt', 'r', encoding='utf-8') as csvfile:
        text = unicode_csv_reader(csvfile, delimiter=',', quotechar='"')
        for item in text:
            for word in item:
                words.append(word)
                tregex = re.compile(ur'[?&/\'\r\n]', re.IGNORECASE)
                newtext1 = tregex.sub(' ', text)
                newtext = re.sub(' +', ' ', newtext1)
                words = re.split(r' ', newtext)
                print words

但我收到此错误

 Traceback (most recent call last):
File "D:\KKSC\KKSC.py", line 150, in OnCheckSpell
tokenize()
File "D:\KKSC\KKSC.py", line 32, in tokenize
newtext1 = tregex.sub(' ', text)

TypeError：期望的字符串或缓冲区

Answer 1

newtext1 = tregex.sub(' ', text)

text是一个二维字符串数组，而sub需要一个字符串。你的意思是：

newtext1 = tregex.sub(' ', word) ?

使用正则表达式python进行标记化

1 个答案: