我试图从目录中获取所有文件,然后通过一系列def函数(python 3)运行它们,并将每个处理过的文件输出到某个目录中。下面是我的代码:
import re
import glob
import sys
import string
#Create Stop_word Corpora
file1=open("/home/file/corps/stopwords.txt", 'rt', encoding='latin-1')
line= file1.read()
theWords=line.split()
stop_words=sorted(set(theWords)) # Stop Word Corpora
#Gather txt files to be processed
folder_path = "/home/file"
file_pattern = "/*txt"
folder_contents = glob.glob(folder_path + file_pattern)
#Read in the Txt Files
for file in folder_contents:
print("Checking", file)
words= []
for file in folder_contents:
read_file = open(file, 'rt', encoding='latin-1').read()
words.extend(read_file.split())
def to_lowercase(words):
#"""Convert all characters to lowercase from list of tokenized words"""
new_words=[]
for word in words:
new_word=word.lower()
new_words.append(new_word)
return new_words
def remove_punctuation(words):
#"""Remove punctuation from list of tokenized words"""
new_words=[]
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
return new_words
def replace_numbers(words):
#""""""Replace all interger occurrences in list of tokenized words with textual representation"
new_words=[]
for word in words:
new_word= re.sub(" \d+", " ", word)
if new_word !='':
new_words.append(new_word)
return new_words
def remove_stopwords(words):
#"""Remove stop words from list of tokenized words"""
new_words=[]
for word in words:
if not word in stop_words:
new_words.append(word)
return new_words
def normalize(words):
words = to_lowercase(words)
words = remove_punctuation(words)
words = replace_numbers(words)
words = remove_stopwords(words)
return words
words = normalize(words)
# Write the new procssed file to a different location
append_file=open("/home/file/Processed_Files",'a')
append_file.write("\n".join(words))
这是我不断收到的错误:
我希望新的文本文件通过def函数运行后,发送到上面的目录中。因此,上面的Processed_files目录中应该有5个新文件。
答案 0 :(得分:2)
您提供的追溯与问题标题中报告的错误不同。
但是您的代码执行了两次:
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
如果words
为空,则for word in words
循环将永远不会执行,甚至不会执行一次。而且,即使一次都没有执行,则不会为new_word
分配任何值。因此,在这种情况下,当您的代码执行if new_word != '':
时,您将得到错误new_word referenced before assignment
。这是因为您的代码正在询问new_word
中的内容,但未分配。
如果您这样编码,此问题将消失:
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
反正我怀疑你的意思。
答案 1 :(得分:1)
我建议3个更改:
创建一个空列表并向其中添加所有单词
words = []
for file in folder_contents:
read_file = open(file, 'rt', encoding='latin-1').read()
words.extend(read_file.split())
正确将列表转换为str
append_file.write("\n".join(words)))
修复不正确的缩进
words = normalize(words)
和
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)