Question

我是Python的新手，所以我可以在这里使用很多帮助！我的目标是撰写一篇文章，过滤掉所有垃圾词，然后最终将它们导入到excel中，以便进行一些文本分析。就目前而言，由于尺寸限制，物品太长而无法复制到单个单元格中。我有以下代码：

article = open(filename, 'w')

letters_only = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          str(article))

stop_words = set(stopwords.words('english')) 

# Tokenize the article: tokens
tokens = word_tokenize(letters_only)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

filtered_sentence = [w for w in alpha_only if not w in stop_words] 

filtered_sentence = [] 

for w in alpha_only: 
    if w not in stop_words: 
        filtered_sentence.append(w)

article.write(str(filtered_sentence))

我遇到的问题是，当我尝试编写文件时，代码基本上删除了所有文本，并且不进行任何覆盖。如果有一种更简单的方法只是准备一个文件供机器学习和/或剥离一个stop_words文件并保存下来，我将不胜感激。

Answer 1

您没有提供所有代码，因为在任何地方都没有提及阅读，所以为了帮助您，我们需要更多的上下文。我仍然会尽力为您提供所提供的帮助。

如果您是从网上加载文章的，建议您将其保留为纯字符串（也就是不要将其保存在文件中），从不需要的地方清除它，然后保存。

否则，如果从文件加载它，则可能更喜欢将已清理的文章保存在另一个文件中，然后删除原始文章。这样可以防止丢失数据。

在这里，由于w标志，您的代码删除了文件的内容，并在文件上不打印任何内容

'w'->将文件截断为零长度或创建要写入的文本文件。流位于文件的开头。

而且，filtered_sentence是一个字符串列表，您不能将其转换为单个字符串

article.write(str(filtered_sentence))

您应该执行以下操作

article.write(" ".join(filtered_sentence))

您可以考虑使用with语句，它会自动关闭文件，您似乎并没有这样做。

with open(filename, 'w') as article:
    article.write(" ".join(filtered_sentence))

Answer 2

当您在上一个答案的注释中添加了更多上下文时，我希望重写所有内容。

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from textract import process
import urllib2, os, re

response = urllib2.urlopen('http://www.website.com/file.pdf') #Request file from interet
tmp_path = 'tmp_path/file.pdf'

with open(tmp_path, 'wb') as tmp_file: #Open the pdf file
    tmp_file.write(response.read()) #Write the request content (aka the pdf file data)

text = process(tmp_path) #Extract text from pdf
text = re.sub("[^a-zA-Z]", " ", text) #Remove all non alphabetical words

os.remove(tmp_path) #Remove the temp pdf file

words = word_tokenize(text)

#words = [t.lower() for t in lower_tokens if t.isalpha()]
#Is the above line useful as you removed all non alphanumerical character at line 13 ?
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in words if w not in stop_words]

with open("path/to/your/article.txt", 'w') as article: #Open destination file
    article.write(" ".join(filtered_sentence)) #Write all the words separated by a space

/！\我没有任何python环境可以对其进行测试（智能手机...），但是应该可以正常工作。如果发生任何错误，请报告，我将予以纠正。

读取，编辑，然后将文本（.txt）文件保存为列表

2 个答案: