Question

我想从本地文件夹中的多个文件中删除停用词。我知道如何对一个文件执行此操作，但是我无法全神贯注地对该文件夹中的所有文件执行操作。

我尴尬地尝试了：

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import glob
import os
import codecs


stop_words = set(stopwords.words('english'))

for afile in glob.glob("*.txt"):
    file1 = open(afile)
    line = file1.read()
    words = word_tokenize(line)
    words_without_stop_words = ["" if word in stop_words else word for word in words]
    new_words = " ".join(words_without_stop_words).strip()
    appendFile = open('subfolder/file1.txt','w')
    appendFile.write(new_words)
    appendFile.close()

我什至不知道我能做到多远，因为我得到了：

回溯（最近通话最近）：在第14行的文件“ C：\ Desktop \ neg \ sw.py” 行= file1.read（）文件“ C：\ Program Files \ Python36 \ lib \ encodings \ cp1252.py”，第23行，在解码中返回codecs.charmap_decode（input，self.errors，decoding_table）[0] UnicodeDecodeError：'charmap'编解码器无法解码位置1757的字节0x9d：字符映射到<undefined>

我尝试使用glob，但是找不到很好的文档。也许没有必要？

Answer 1

似乎文件的编码错误。您将需要使用正确的encoding kwarg调用open()函数（可能是"utf-8"）。并在要附加文件时使用'a'。实际上，我将在处理文件之前打开附加文件，并在写入所有文件后将其关闭。

从停用词过滤单词时，不要将空字符串放入列表中，只需忽略这些单词即可：

words_without_stop_words = [word for word in words if word not in stop_words]
new_words = " ".join(words_without_stop_words).strip()

Answer 2

在写入# redirect all requests on www. made to /book to booking subdomain RewriteCond %{HTTP_HOST} ^www\. RewriteRule ^book(.*)$ https://booking.example.com/book$1 [L,R=301] # redirect all requests on booking. made to not /book to www subdomain RewriteCond %{HTTP_HOST} ^booking\. RewriteCond %{REQUEST_URI} !^/book RewriteRule ^(.*)$ https://www.example.com/$1 [L,R=301] RewriteCond %{REQUEST_FILENAME} -s [OR] RewriteCond %{REQUEST_FILENAME} -l [OR] RewriteCond %{REQUEST_FILENAME} -d RewriteRule ^.*$ - [NC,L] RewriteRule ^.*$ /index.php [NC,L]文件时，您必须添加编码格式，通常可以使用

utf-8

代替将数据写入文件，您必须将数据追加到文件中，以便将所有数据存储到单个文件中。

您也可以使用编解码器插入文件，例如

appendFile = open('subfolder/file1.txt','w', encoding='utf-8')
appendFile.write(new_words)
appendFile.close()

并插入数据。

Answer 3

从完整的堆栈跟踪中，您正在使用具有西欧语言和默认Ansi代码页1252的Windows系统。

您的一个文件包含一个0x9d字节。在读取时，Python尝试将文件字节解码为unicode字符串，但由于0x9d不是有效的CP1252字节而失败，因此失败。

该怎么办？

正确的方法是识别有问题的文件，然后尝试识别其实际编码。一种简单的方法是显示其名称：

for afile in glob.glob("*.txt"):
    with open(afile) as file1:
        try:
            line = file1.read()
        except UnicodeDecodeError as e:
            print("Wrong encoding file", afile, e)       # display file name and error
            continue                                     # skip to next file
    ...

或者，如果错误仅在几个文件中只有几个单词出现时发生，则可以简单地忽略或替换有问题的字节：

for afile in glob.glob("*.txt"):
    with open(afile, errors = "replace") as file1:
        line = file1.read()
    ...

循环浏览文件以删除停用词

3 个答案: