Question

我是python的初学者，我使用这一行

reader = CategorizedPlaintextCorpusReader('~/CorpusMain/',
                                      r'.*\.txt', cat_pattern=r'(\w+)/*')

在我的CorpusMain文件夹中，我还有三个类别的文件夹。我需要分别访问每个类别中的每个文本文件内容，为包含文本文件的每个类别构建一个列表作为元素..例如 category1 = [＆＃39; textfile1 content＆＃39;，＆＃39; textfile2 content＆＃39; ... etc] 我想用我的阅读器做这个，意思是引用每个文件（fileids（））并得到它的reader.raw结果......

我需要将它们反馈给我的CountVectorizer，为每个类别构建一个向量。

Answer 1

我建议像os.listdir那样返回指定为其参数的路径内容列表。

一个例子：

对于像：

这样的目录结构

CorpusMain
├ text1.txt
└ text2.txt

text1.txt：

Text 1 content

text2.txt：

Text 2 content

以下代码：

import os

def get_txt_content(path, txt):
    with open(path + r'\\' + txt, 'r') as text_file:
        return text_file.read()

def list_txt_content(path):
    textfiles = [_file for _file in os.listdir(path) if _file.endswith('.txt')]
    return [get_txt_content(path, txt) for txt in textfiles]

print list_txt_content(r'~/CorpusMain')

将生成如下列表：

['Text 1 content', 'Text 2 content']

希望它有所帮助。

如何使用循环访问我自己的分类语料库中的每个文本文件？

1 个答案: