将原始数据转换为熊猫数据框?

时间:2020-05-29 12:11:15

标签: python pandas dataframe

我有一个字符串中的原始数据,基本上是多个关键字,形式为-

Law, of, three, stages
Alienation
Social, Facts
Theory, of, Social, System

如何将其导入数据框,以便计算重复次数并返回每个单词的计数?

编辑:我已经将其转换为以下格式

 Law,of,three,stages,Alienation,Social,Facts,Theory,of,Social,System

我想将其转换为数据框,因为我最终要预测哪个单词重复出现的可能性最高。

2 个答案:

答案 0 :(得分:0)

使用字典

word_count_dict = {}
with open("Yourfile.txt") as file_stream:
     lines = file_stream.readlines()
     for line in lines:
         if "," in line:
            line = line.split(",")
         else:
            line = [line]
         for item in line:
             if item in word_count_dict.keys():
                   word_count_dict[item] += 1
             else:
                   word_count_dict[item] = 1 

因为现在,如果您想要基于概率的顺序,将拥有所有单词计数列表。建议将每个值除以总发生次数

total = sum(word_count_dict.itervalues(), 0.0)
probability_words = {k: v / total for k, v in word_count_dict.iteritems()}

现在概率词具有该特定词出现的所有机会。

基于概率的逆序

sorted_probability_words = sorted(probability_words, key = lambda x : x[1], reverse = True)

获得最高机率的第一个元素

print(sorted_probability_words[0]) # to access the word Key value
print(sorted_probability_words[0][0]) # to get the first word 
print(sorted_probability_words[0][1]) # to get the first word  probability

答案 1 :(得分:0)

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name': [ 'Law','of','three','stages','Alienation','Social','Facts','Theory','of','Social','System']
})

df['name'] = df.name.str.split('[ ,]', expand=True)

print(df)

word_freq = pd.Series(np.concatenate([x.split() for x in df.name])).value_counts()
print(word_freq)

enter image description here

相关问题