我有一个字符串中的原始数据,基本上是多个关键字,形式为-
Law, of, three, stages
Alienation
Social, Facts
Theory, of, Social, System
如何将其导入数据框,以便计算重复次数并返回每个单词的计数?
编辑:我已经将其转换为以下格式
Law,of,three,stages,Alienation,Social,Facts,Theory,of,Social,System
我想将其转换为数据框,因为我最终要预测哪个单词重复出现的可能性最高。
答案 0 :(得分:0)
使用字典
word_count_dict = {}
with open("Yourfile.txt") as file_stream:
lines = file_stream.readlines()
for line in lines:
if "," in line:
line = line.split(",")
else:
line = [line]
for item in line:
if item in word_count_dict.keys():
word_count_dict[item] += 1
else:
word_count_dict[item] = 1
因为现在,如果您想要基于概率的顺序,将拥有所有单词计数列表。建议将每个值除以总发生次数
total = sum(word_count_dict.itervalues(), 0.0)
probability_words = {k: v / total for k, v in word_count_dict.iteritems()}
现在概率词具有该特定词出现的所有机会。
基于概率的逆序
sorted_probability_words = sorted(probability_words, key = lambda x : x[1], reverse = True)
获得最高机率的第一个元素
print(sorted_probability_words[0]) # to access the word Key value
print(sorted_probability_words[0][0]) # to get the first word
print(sorted_probability_words[0][1]) # to get the first word probability
答案 1 :(得分:0)
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': [ 'Law','of','three','stages','Alienation','Social','Facts','Theory','of','Social','System']
})
df['name'] = df.name.str.split('[ ,]', expand=True)
print(df)
word_freq = pd.Series(np.concatenate([x.split() for x in df.name])).value_counts()
print(word_freq)