import re
import numpy as np
with open('cat_sentences.txt') as f:
lines = sum(1 for line in f)
f.seek(0)
for line in f:
re.split('[^a-z]', line.lower())
L = []
L.append(re.split('[^a-z]', line.lower()))
L1 = []
for element in L:
for simbol in element:
if simbol != '':
L1.append(simbol)
wcount = 0
D = dict()
for element in L1:
if element not in D:
D[element] = wcount
wcount += 1
else:
D.pop(element)
print D
我需要在Python中创建字典,该字典由文本中的所有单词组成,除了字母以外没有任何空格和符号。稍后,我将需要创建矩阵M x N,其中M是原始文本中字符串的数量,N是字典中单词的数量。我的代码如下:
答案 0 :(得分:0)
也许你需要这个(如果我正确理解你的需求):
import re
from collections import Counter
text = 'Hello this is text - yes it is'
text_list = re.split('[^a-z]+', text.lower())
count = Counter(text_list)
df = pd.DataFrame(count, index=[0])
在这种情况下,您将获得下一个数据帧:
hello is it text this yes
1 2 1 1 1 1
或者您可能需要下一个矢量化(但是您需要什么值?):
from sklearn.feature_extraction.text import TfidfVectorizer
text_list = []
with open('cat_sentences.txt') as f:
for line in f:
text_list.append(line.lower().replace('[^\w\s]',' '))
print(text_list)
tfidf_v = TfidfVectorizer(min_df=1,stop_words= None)
X = tfidf_v.fit_transform(text_list)
data = pd.DataFrame(data=X.toarray(), columns=tfidf_v.get_feature_names(), index = text_list)
在这种情况下,您将获得数据框,其中row将是文本中的row,column名称将是world,值-频率(您可以在此处http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html中阅读)