使用正则表达式python进行标记化

时间:2015-05-29 18:48:01

标签: python regex python-2.7 tokenize

我有以下代码,我想用正则表达式

标记位于我的目录中的文本
def tokenize():
    infile = codecs.open('test_test.txt', 'r', encoding='utf-8')
    text = infile.read()
    infile.close()
    words = []
    with io.open('test_test.txt', 'r', encoding='utf-8') as csvfile:
        text = unicode_csv_reader(csvfile, delimiter=',', quotechar='"')
        for item in text:
            for word in item:
                words.append(word)
                tregex = re.compile(ur'[?&/\'\r\n]', re.IGNORECASE)
                newtext1 = tregex.sub(' ', text)
                newtext = re.sub(' +', ' ', newtext1)
                words = re.split(r' ', newtext)
                print words

但我收到此错误

 Traceback (most recent call last):
File "D:\KKSC\KKSC.py", line 150, in OnCheckSpell
tokenize()
File "D:\KKSC\KKSC.py", line 32, in tokenize
newtext1 = tregex.sub(' ', text)

TypeError:期望的字符串或缓冲区

1 个答案:

答案 0 :(得分:0)

newtext1 = tregex.sub(' ', text)

text是一个二维字符串数组,而sub需要一个字符串。你的意思是:

newtext1 = tregex.sub(' ', word) ?