尝试在sklearn软件包中为countVectorizer安装语料库

时间:2020-04-07 16:57:03

标签: corpus countvectorizer

我正在尝试使用for循环一次将本地驱动器的主体从python加载到python中,然后读取每个文本文件并将其保存以供countVectorizer分析。但是,我只得到最后一个文件。如何从所有要存储的文件中获取结果,以便使用countVectorizer分析?

此代码从文件夹中的最后一个文件中提取文本。

folder_path = "folder"

#import and read all files in animal_corpus
for filename in glob.glob(os.path.join(folder_path, '*.txt')):
    with open(filename, 'r') as f: 
        txt = f.read()
        print(txt)
MyList= [txt]

## Create a CountVectorizer object that you can use
MyCV1 = CountVectorizer()
## Call your MyCV1 on the data
DTM1 = MyCV1.fit_transform(MyList)
## get col names
ColNames=MyCV1.get_feature_names()
print(ColNames)

## convert DTM to DF

MyDF1 = pd.DataFrame(DTM1.toarray(), columns=ColNames)
print(MyDF1)

此代码有效,但不适用于我正在为其准备的庞大语料库。

#import and read text files 
f1 = open("folder/animal_1.txt",'r')
f1r = f1.read()
f2 = open("/folder/animal_2.txt",'r')
f2r = f2.read()
f3 = open("/folder/animal_3.txt",'r')
f3r = f3.read()

#reassemble corpus in python
MyCorpus=[f1r, f2r, f3r]

## Create a CountVectorizer object that you can use
MyCV1 = CountVectorizer()
## Call your MyCV1 on the data
DTM1 = MyCV1.fit_transform(MyCorpus)
## get col names
ColNames=MyCV1.get_feature_names()
print(ColNames)

## convert DTM to DF

MyDF2 = pd.DataFrame(DTM1.toarray(), columns=ColNames)
print(MyDF2)

1 个答案:

答案 0 :(得分:0)

我知道了。只是要继续磨。

MyCorpus=[]
#import and read all files in animal_corpus
for filename in glob.glob(os.path.join(folder_path, '*.txt')):
    with open(filename, 'r') as f: 
        txt = f.read()
        MyCorpus.append(txt)