我是Python编程的新手,所以弄清楚如何编写更高级的动作已成为我的挑战。
我的任务是计算10个文档的语料库的TF-IDF。但是我仍然坚持如何对语料库进行标记化,并打印出标记数和唯一标记数。
如果任何人都可以帮助甚至指导我朝正确的方向发展,将不胜感激!
答案 0 :(得分:0)
这可能会有所帮助。
我有一个单独的文本文件集合,我想将其提取并适合转换为TfidfVectorizer。这将逐步介绍提取文件和使用TfidfVectorizer的过程。
我去kaggle以获得一些有关电影评论的示例数据
我使用了负面评论。就我的目的而言,数据是什么无关紧要,我只需要一些文本数据即可。
导入所需软件包
import pandas as pd
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
将如何使用这些软件包?
我们将使用熊猫为TfidfVectorizer暂存数据
glob将用于收集文件目录位置
TfidfVectorizer是该节目的明星
使用Glob收集文件位置
ls_documents = []
for name in glob.glob('/location/to/folder/with/document/files/*'):
ls_documents.append(name)
这将产生文件位置列表。
从前10个文件中读取数据
ls_text = []
for document in ls_documents[:10]:
f = open(document,"r")
ls_text.append(f.read())
我们现在有一个文本列表。
导入熊猫
df_text = pd.DataFrame(ls_text)
重命名该列以使其更易于使用
df_text.columns = ['raw_text']
通过删除任何包含空值的行来清理数据
df_text['clean_text'] = df_text['raw_text'].fillna('')
您可能选择进行其他一些清洁。保留原始数据并创建单独的“干净”列很有用。
创建一个tfidf对象-我将为其提供英语停用词
tfidf = TfidfVectorizer(stop_words='english')
通过传递tfidf clean_text系列来拟合并转换上面创建的clean_text
tfidf_matrix = tfidf.fit_transform(df_text['clean_text'])
您可以从tfidf中看到功能名称
tfidf.get_feature_names()
您会看到类似这样的东西
['10',
'13',
'14',
'175',
'1960',
'1990s',
'1997',
'20',
'2001',
'20th',
'2176',
'60',
'80',
'8mm',
'90',
'90s',
'_huge_',
'aberdeen',
'able',
'abo',
'accent',
'accentuate',
'accident',
'accidentally',
'accompany',
'accurate',
'accused',
'acting',
'action',
'actor',
....
]
您可以查看矩阵的形状
tfidf_matrix.shape
在我的示例中,形状为
(10, 1733)
大约意味着1733个单词(即令牌)描述了10个文档
如果不确定要做什么,您可能会发现这两篇文章很有用。
答案 1 :(得分:0)
我对此采取了一种有趣的方法。我使用的数据与提供的@the_good_pony相同,因此将使用相同的路径。
我们将使用os和re模块,因为正则表达式既有趣又具有挑战性!
import os
import re
# Path to where our data is located
base_Path = r'C:\location\to\folder\with\document\files
# Instantiate an empty dictonary
ddict = {}
# were going to walk our directory
for root, subdirs, filename in os.walk(base_path):
# For each sub directory ('neg' and 'pos,' in this case)
for d in subdirs:
# Create a NEW dictionary with the subdirectory name as key
ddict[d] = {}
# Create a path to the subdirectory
subroot = os.path.join(root, d)
# Get a list of files for the directory
# Save time by creating a new path for each file
file_list = [os.path.join(subroot, i) for i in os.listdir(subroot) if i.endswith('txt')]
# For each file in the filelist, open and read the file into the
# subdictionary
for f in file_list:
# Basename = root name of path to file, or the filename
fkey = os.path.basename(f)
# Read file and set as subdictionary value
with open(f, 'r') as f:
ddict[d][fkey] = f.read()
f.close()
样品计数:
len(ddict.keys()) # 2 top-level subdirectories
len(ddict['neg'].keys()) # 1000 files in our 'neg' subdirectory
len(ddict['pos'].keys()) # 1000 files in our 'pos' subdirectory
# sample file content
# use two keys (subdirectory name and filename)
dirkey = 'pos'
filekey = 'cv000_29590.txt'
test1 = ddict[dirkey][filekey]
输出:
'films adapted from comic books have had plenty of success , whether they\'re about superheroes ( batman , superman , spawn ) , o [...]'
### Simple counter dictionary function
def val_counter(iterable, output_dict=None):
# Instanciate a new dictionary
if output_dict is None:
output_dict = dict()
# Check if element in dictionary
# Add 1 if yes, or set to 1 if no
for i in iterable:
if i in output_dict.keys():
output_dict[i] += 1
else:
output_dict[i] = 1
return output_dict
使用正则表达式(在这里我已经过分习惯了),我们可以清理每个语料库中的文本并将字母数字项捕获到列表中。我添加了一个选项来包含小词(在这种情况下为1个字符),但是获取停用词并不难。
def wordcounts(corpus, dirname='pos', keep_small_words=False, count_dict=None):
if count_dict is None:
count_dict = dict()
get_words_pat = r'(?:\s*|\n*|\t*)?([\w]+)(?:\s*|\n*|\t*)?'
p = re.compile(get_words_pat)
def clean_corpus(x):
# Replace all whitespace with single-space
clear_ws_pat = r'\s+'
# Find nonalphanumeric characters
remove_punc_pat = r'[^\w+]'
tmp1 = re.sub(remove_punc_pat, ' ', x)
# Respace whitespace and return
return re.sub(clear_ws_pat, ' ', tmp1)
# List of our files from the subdirectory
keylist = list(corpus[dirname])
for k in keylist:
cleand = clean_corpus(corpus[dirname][k])
# Tokenize based on size
if keep_small_words:
tokens = p.findall(cleand)
else:
# limit to results > 1 char in length
tokens = [i for i in p.findall(cleand) if len(i) > 1]
for i in tokens:
if i in count_dict.keys():
count_dict[i] += 1
else:
count_dict[i] = 1
# Return dictionary once complete
return count_dict
### Dictionary sorted lambda function
dict_sort = lambda d, descending=True: dict(sorted(d.items(), key=lambda x: x[1], reverse=descending))
# Run our function for positive corpus values
pos_result_dict = wordcounts(ddict, 'pos')
pos_result_dict = dict_sort(pos_result_dict)
最终处理和打印:
# Create dictionary of how frequent each count value is
freq_dist = val_counter(pos_result_dict.values())
freq_dist = dict_sort(freq_dist)
# Stats functions
k_count = lambda x: len(x.keys())
sum_vals = lambda x: sum([v for k, v in x.items()])
calc_avg = lambda x: sum_vals(x) / k_count(x)
# Get mean (arithmetic average) of word counts
mean_dict = calc_avg(pos_result_dict)
# Top-half of results. We count shrink this even further, if necessary
top_dict = {k:v for k, v in pos_result_dict.items() if v >= mean_dict}
# This is probably your TD-IDF part
tot_count= sum([v for v in top_dict.values()])
for k, v in top_dict.items():
pct_ = round(v / tot_count, 4)
print('Word: ', k, ', count: ', v, ', %-age: ', pct_)