Question

假设我在桌面上的文件夹中有不同的.txt文件。它们看起来像这样。

File_1：

('this', 'is'), ('a', 'very'),....., ('large', '.txt'), ('file', 'with'), ('lots', 'of'), ('words', 'like'), ('this', 'i'), ('would', 'like'), ('to', 'create'), ('a', 'matrix'),'LABEL_1'

...

File_N：

('this', 'is'), ('a', 'another'),....., ('large', '.txt'), ('file', 'with'), ('lots', 'of'), ('words', 'like'), ('this', 'i'), ('would', 'like'), ('to', 'create'), ('a', 'matrix'),'LABEL_N'

从documentation，scikit-learn提供load_files，我可以使用散列技巧进行矢量化，如下所示：

from sklearn.feature_extraction.text import FeatureHasher
from sklearn.svm import SVC

training_data = [[('string1', 'string2'), ('string3', 'string4'),
                  ('string5', 'string6'), 'POS'],
                 [('string1', 'string2'), ('string3', 'string4'), 'NEG']]

feature_hasher_vect = FeatureHasher(input_type ='string')

X = feature_hasher_vect.transform(((' '.join(x) for x in sample)
                                        for sample in training_data))

print X.toarray()

输出：

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

如何使用load_files()或任何其他方法将整个.txt文件夹矢量化（应用上述相同的步骤）？

Answer 1

我不熟悉skikit-learn，它可能有更好的东西，但是你可以做你所描述的，如果文件是使用相对简单的东西显示的格式，如下面的函数所示：

import ast
import glob
import os

def my_load_files(folder, pattern):
    pathname = os.path.join(folder, pattern)
    for filename in glob.glob(pathname):
        with open(filename) as file:
            yield ast.literal_eval(file.read())

text_folder = 'C:/Users/username/Desktop/Samples'
print [[' '.join(x) for x in sample]
                        for sample in my_load_files(text_folder, 'File_*')]

注意：由于每个文件末尾都有一个标签（以及您的training_data），您可能希望使用以下内容，而不会将其传递给{{1方法：

feature_hasher_vect.transform()

如何使用scikit-learn load_files并处理.txt文件？

1 个答案: