Question

我是Python编程的新手，所以弄清楚如何编写更高级的动作已成为我的挑战。

我的任务是计算10个文档的语料库的TF-IDF。但是我仍然坚持如何对语料库进行标记化，并打印出标记数和唯一标记数。

如果任何人都可以帮助甚至指导我朝正确的方向发展，将不胜感激！

Answer 1

这可能会有所帮助。

我有一个单独的文本文件集合，我想将其提取并适合转换为TfidfVectorizer。这将逐步介绍提取文件和使用TfidfVectorizer的过程。

我去kaggle以获得一些有关电影评论的示例数据

我使用了负面评论。就我的目的而言，数据是什么无关紧要，我只需要一些文本数据即可。

导入所需软件包

import pandas as pd 
import glob
from sklearn.feature_extraction.text import TfidfVectorizer

将如何使用这些软件包？

我们将使用熊猫为TfidfVectorizer暂存数据
glob将用于收集文件目录位置
TfidfVectorizer是该节目的明星

使用Glob收集文件位置

ls_documents = [] 
for name in glob.glob('/location/to/folder/with/document/files/*'):
    ls_documents.append(name)

这将产生文件位置列表。

从前10个文件中读取数据

ls_text = []
for document in ls_documents[:10]:
    f = open(document,"r")
    ls_text.append(f.read())

我们现在有一个文本列表。

导入熊猫

df_text = pd.DataFrame(ls_text)

重命名该列以使其更易于使用

df_text.columns = ['raw_text']

通过删除任何包含空值的行来清理数据

df_text['clean_text'] = df_text['raw_text'].fillna('')

您可能选择进行其他一些清洁。保留原始数据并创建单独的“干净”列很有用。

创建一个tfidf对象-我将为其提供英语停用词

tfidf = TfidfVectorizer(stop_words='english')

通过传递tfidf clean_text系列来拟合并转换上面创建的clean_text

tfidf_matrix = tfidf.fit_transform(df_text['clean_text'])

您可以从tfidf中看到功能名称

tfidf.get_feature_names()

您会看到类似这样的东西

['10',
 '13',
 '14',
 '175',
 '1960',
 '1990s',
 '1997',
 '20',
 '2001',
 '20th',
 '2176',
 '60',
 '80',
 '8mm',
 '90',
 '90s',
 '_huge_',
 'aberdeen',
 'able',
 'abo',
 'accent',
 'accentuate',
 'accident',
 'accidentally',
 'accompany',
 'accurate',
 'accused',
 'acting',
 'action',
 'actor',
....
]

您可以查看矩阵的形状

tfidf_matrix.shape

在我的示例中，形状为

(10, 1733)

大约意味着1733个单词（即令牌）描述了10个文档

如果不确定要做什么，您可能会发现这两篇文章很有用。

DataCamp的article在推荐系统中使用tfidf
DataCamp的article具有一些常规的NLP流程技术

Answer 2

我对此采取了一种有趣的方法。我使用的数据与提供的@the_good_pony相同，因此将使用相同的路径。

我们将使用os和re模块，因为正则表达式既有趣又具有挑战性！

import os
import re

# Path to where our data is located
base_Path = r'C:\location\to\folder\with\document\files


# Instantiate an empty dictonary
ddict = {}

# were going to walk our directory
for root, subdirs, filename in os.walk(base_path):
    # For each sub directory ('neg' and 'pos,' in this case)
    for d in subdirs:
        # Create a NEW dictionary with the subdirectory name as key
        ddict[d] = {}

        # Create a path to the subdirectory
        subroot = os.path.join(root, d)

        # Get a list of files for the directory
        # Save time by creating a new path for each file
        file_list = [os.path.join(subroot, i) for i in os.listdir(subroot) if i.endswith('txt')]

        # For each file in the filelist, open and read the file into the
        # subdictionary
        for f in file_list:
            # Basename = root name of path to file, or the filename
            fkey = os.path.basename(f)

            # Read file and set as subdictionary value
            with open(f, 'r') as f:
                ddict[d][fkey] = f.read()
            f.close()

样品计数：

len(ddict.keys()) # 2 top-level subdirectories
len(ddict['neg'].keys()) # 1000 files in our 'neg' subdirectory
len(ddict['pos'].keys()) # 1000 files in our 'pos' subdirectory

# sample file content
# use two keys (subdirectory name and filename)

dirkey = 'pos'
filekey = 'cv000_29590.txt'
test1 = ddict[dirkey][filekey]

输出：

'films adapted from comic books have had plenty of success , whether they\'re about superheroes ( batman , superman , spawn ) , o [...]'


### Simple counter dictionary function
def val_counter(iterable, output_dict=None):
    # Instanciate a new dictionary
    if output_dict is None:
        output_dict = dict()

    # Check if element in dictionary
    # Add 1 if yes, or set to 1 if no
    for i in iterable:
        if i in output_dict.keys():
            output_dict[i] += 1
        else:
            output_dict[i] = 1
    return output_dict

使用正则表达式（在这里我已经过分习惯了），我们可以清理每个语料库中的文本并将字母数字项捕获到列表中。我添加了一个选项来包含小词（在这种情况下为1个字符），但是获取停用词并不难。

def wordcounts(corpus, dirname='pos', keep_small_words=False, count_dict=None):
    if count_dict is None:
        count_dict = dict()

    get_words_pat = r'(?:\s*|\n*|\t*)?([\w]+)(?:\s*|\n*|\t*)?'
    p = re.compile(get_words_pat)

    def clean_corpus(x):
        # Replace all whitespace with single-space
        clear_ws_pat = r'\s+'
        # Find nonalphanumeric characters
        remove_punc_pat = r'[^\w+]'


        tmp1 = re.sub(remove_punc_pat, ' ', x)

        # Respace whitespace and return
        return re.sub(clear_ws_pat, ' ', tmp1)

    # List of our files from the subdirectory
    keylist = list(corpus[dirname])


    for k in keylist:
        cleand = clean_corpus(corpus[dirname][k])

        # Tokenize based on size
        if keep_small_words:
            tokens = p.findall(cleand)
        else:
            # limit to results > 1 char in length
            tokens = [i for i in p.findall(cleand) if len(i) > 1]


        for i in tokens:
            if i in count_dict.keys():
                count_dict[i] += 1
            else:
                count_dict[i] = 1

    # Return dictionary once complete
    return count_dict



### Dictionary sorted lambda function

dict_sort = lambda d, descending=True: dict(sorted(d.items(), key=lambda x: x[1], reverse=descending))

# Run our function for positive corpus values
pos_result_dict = wordcounts(ddict, 'pos')
pos_result_dict = dict_sort(pos_result_dict)

最终处理和打印：

# Create dictionary of how frequent each count value is
freq_dist = val_counter(pos_result_dict.values())
freq_dist = dict_sort(freq_dist)


# Stats functions
k_count = lambda x: len(x.keys())
sum_vals = lambda x: sum([v for k, v in x.items()])
calc_avg = lambda x: sum_vals(x) / k_count(x)

# Get mean (arithmetic average) of word counts
mean_dict = calc_avg(pos_result_dict)

# Top-half of results.  We count shrink this even further, if necessary
top_dict = {k:v for k, v in pos_result_dict.items() if v >= mean_dict}

# This is probably your TD-IDF part
tot_count= sum([v for v in top_dict.values()])
for k, v in top_dict.items():
    pct_ = round(v / tot_count, 4)
    print('Word: ', k, ', count: ', v, ', %-age: ', pct_)

在Python中标记10个文档的语料库

2 个答案: