Question

我有一个字符串中的原始数据，基本上是多个关键字，形式为-

Law, of, three, stages
Alienation
Social, Facts
Theory, of, Social, System

如何将其导入数据框，以便计算重复次数并返回每个单词的计数？

编辑：我已经将其转换为以下格式

 Law,of,three,stages,Alienation,Social,Facts,Theory,of,Social,System

我想将其转换为数据框，因为我最终要预测哪个单词重复出现的可能性最高。

Answer 1

使用字典

word_count_dict = {}
with open("Yourfile.txt") as file_stream:
     lines = file_stream.readlines()
     for line in lines:
         if "," in line:
            line = line.split(",")
         else:
            line = [line]
         for item in line:
             if item in word_count_dict.keys():
                   word_count_dict[item] += 1
             else:
                   word_count_dict[item] = 1

因为现在，如果您想要基于概率的顺序，将拥有所有单词计数列表。建议将每个值除以总发生次数

total = sum(word_count_dict.itervalues(), 0.0)
probability_words = {k: v / total for k, v in word_count_dict.iteritems()}

现在概率词具有该特定词出现的所有机会。

基于概率的逆序

sorted_probability_words = sorted(probability_words, key = lambda x : x[1], reverse = True)

获得最高机率的第一个元素

print(sorted_probability_words[0]) # to access the word Key value
print(sorted_probability_words[0][0]) # to get the first word 
print(sorted_probability_words[0][1]) # to get the first word  probability

Answer 2

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name': [ 'Law','of','three','stages','Alienation','Social','Facts','Theory','of','Social','System']
})

df['name'] = df.name.str.split('[ ,]', expand=True)

print(df)

word_freq = pd.Series(np.concatenate([x.split() for x in df.name])).value_counts()
print(word_freq)

将原始数据转换为熊猫数据框？

2 个答案: