Question

我是python编程的新手。现在我正在对文本文件进行自然语言处理。问题是我有大约200个文本文件，因此很难单独加载每个文件并应用相同的方法。

这是我的计划：

import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import FreqDist
with open("c:/users/user/desktop/datascience/sotu/stopwords.txt", 'r') as sww:
    sw = sww.read()
**with open("c:/users/user/desktop/datascience/sotu/a41.txt", 'r') as a411:
    a41 = a411.read()
    a41c=word_tokenize(str(a41))
    a41c = [w for w in a41c if not w in sw]**

所以我想在多个文件上应用此方法。有没有办法我可以一步加载所有文件并应用相同的方法。我试过这个，但它不起作用：

import os
import glob
import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import FreqDist
with open("c:/users/user/desktop/datascience/sotu/stopwords.txt", 'r') as sww:
    sw = sww.read()
for filename in glob.glob(os.path.join("c:/users/user/desktop/DataScience/sotu/",'*.txt')):
    filename=word_tokenize(str(filename))
    filename = [w for w in filename if not w in sw]
xqc=FreqDist(filename)

请帮忙。

Answer 1

首先，第二种方法不起作用，因为您实际上并未加载要检查的文件。在第一个（可能是工作示例）中，您在表示文件内容的字符串上调用word_tokenize，在文件名中执行第二个字符串。注意，你的代码在这里真的不清楚：

for filename in glob.glob(os.path.join("c:/users/user/desktop/DataScience/sotu/",'*.txt')): filename=word_tokenize(str(filename)) filename = [w for w in filename if not w in sw]

不要在3行中使用3次文件名！第一次使用只是它代表的内容，第二次使用代表一个标记化的单词列表，第三次使用代表相同的单词列表但是过滤了！

作为另一个提示，请尝试为变量提供更具描述性的名称。我不熟悉NLP，但是查看代码的人可能想知道xqc的含义。

这是一个片段，我希望您可以从中推断出如何应用于您自己的代码。

stopwords_filename = "words.txt"
stop_words = []
with open(stopwords_filename, "r") as stopwords_file:
    stop_words = stopwords_file.read()

words_input_dir = "c:/users/user/desktop/DataScience/sotu/"

for filename in os.listdir(words_input_dir):
    if filename.endswith(".txt"):
        with open(filename, "r") as input_file:
            input_tokens = word_tokensize(input_file.read())
            # Do everything else.`

如何使用python在多个文本文件上加载和应用相同的算法

1 个答案: