通过python中的单词数进行文本分割

时间:2018-11-20 12:02:08

标签: python-3.x

谁能告诉我我自己的代码有什么问题? 我想按单词将大文本分割成小文本。例如,每个段包含60个单词。

file=r'C:\Users\Nujou\Desktop\Master\thesis\steganalysis\dataset\economy2.txt'

openFile= open(file, 'r', encoding='utf-8-sig')
words= openFile.read().split()
#print (words)

i = 0
for idx, w in enumerate(words, start=0):
    textNum = 1
    while textNum <= 20:
        wordAsText = []
        print("word list before:", wordAsText)
        while i<idx+60:
            wordAsText.append(words[i])
            i+=1
        print ("word list after:", wordAsText)
        textSeg=' '.join(wordAsText)
        print (textNum, textSeg)
        files = open(r"C:\Users\Nujou\Desktop\Master\thesis\steganalysis\dataset\datasetEco\Eco" + str(textNum) + ".txt", "w", encoding='utf-8-sig')
        files.write(textSeg)
        files.close()
        idx+=60
        if textNum!=20:
            continue
        textNum+=1

我的大文件(economy2)包含超过12K个单词。

编辑: 感谢您的所有回复。我尝试了发现的here,并达到了我的要求。

修改后的代码:

file=r'C:\Users\Nujou\Desktop\Master\thesis\steganalysis\dataset\economy2.txt'

openFile= open(file, 'r', encoding='utf-8-sig')
words= openFile.read().split()
#print (words)
n=60
segments=[' '.join(words[i:i+n]) for i in range(0,len(words),n)] #from link
i=1
for s in segments:
    seg = open(r"C:\Users\Nujou\Desktop\Master\thesis\steganalysis\dataset\datasetEco\Eco" + str(i) + ".txt", "w", encoding='utf-8-sig')
    seg.write(s)
    seg.close()
    i+=1

0 个答案:

没有答案