假设我在Python 3中读取了一个文件(.txt文件)。接下来,我需要将单个列表的内容解析为与原始文件的空白和\n
断句符有关的多个列表。
我需要分别编写和保存包含列表列表的新文件。
完成此操作后,我应该在目录中包含2个文件,一个文件包含entire text in a single list
,另一个文件包含containing only the list of lists
。
我已经尝试过但尚未成功,感谢您的帮助。
我要敲击的代码如下。
import nltk, re
import string
from collections import Counter
from string import punctuation
from nltk.tokenize import TweetTokenizer, sent_tokenize, word_tokenize
from nltk.corpus import gutenberg, stopwords
from nltk.stem import WordNetLemmatizer
def remove_punctuation(from_text):
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in from_text]
return stripped
def preprocessing():
with open("I:\\(2018 - 2019)\\College Desktop\\Pen Drive 8 GB\\PDF\\Code\\Books Handbooks\\Books Handbooks Text\\b1.txt", encoding="utf-8") as f:
tokens_sentences = sent_tokenize(f.read())
tokens = [[word.lower() for word in line.split()] for line in tokens_sentences]
global stripped_tokens
stripped_tokens = [remove_punctuation(i) for i in tokens]
sw = (stopwords.words('english'))
filter_set = [[token for token in sentence if (token.lower() not in sw and token.isalnum() and token.isalpha() and re.findall(r"[^_ .'\"-[A-Za-z]]+", token))] for sentence in stripped_tokens]
lemma = WordNetLemmatizer()
lem = []
for w in filter_set:
lem.append([wi for wi in map(lemma.lemmatize, w)])
return lem
result = preprocessing()
with open('I:\\(2018 - 2019)\\College Desktop\\Pen Drive 8 GB\\PDF\\Code\\Books Handbooks\\Books Handbooks Text\\b1_list.txt', "w", encoding="utf-8") as f1:
for e in result[:3]:
f1.write(str(e))
preprocessing()
我很沮丧,因为程序正确执行。没有错误,但输出不是所需的。例如,在上面的代码中,我希望在新文件中写入前3个句子。
但是当我打开新文件时,它显示3个空列表,有点像[] [] []。为什么会这样?