如何删除频率表中的重复项?

时间:2019-10-16 02:49:17

标签: python pandas duplicates

我是编码的初学者,已经制作了一个代码,该代码可以计算单词的出现频率,然后使用panda包放入表格中,但是我需要删除生成的重复项。

我遵循了有关如何删除重复项的在线教程,但是当前的代码仍然无法正常工作,如第二个输入所示。任何反馈,不胜感激。

输入

  txt = "chilli mango chilli mango grape"
  words = txt.split()
  for word in words:
        print(word + " " + str(txt.count(word)))
  import pandas as pd
  mytable = pd.DataFrame()
  for word in words:
        tempdf = pd.DataFrame({"word" : [word], "frequency" : [txt.count(word)]})
        mytable = mytable.append(tempdf)
        print(mytable)

输出

 chilli 2
 mango 2
 chilli 2
 mango 2
 grape 1

 word  frequency
 0  chilli          2
 word  frequency
 0  chilli          2
 0   mango          2
 word  frequency
 0  chilli          2
 0   mango          2
 0  chilli          2
 word  frequency
 0  chilli          2
 0   mango          2
 0  chilli          2
 0   mango          2
 word  frequency
 0  chilli          2
 0   mango          2
 0  chilli          2
 0   mango          2
 0   grape          1

输入

data = mytable
data.sort_values("First name", inplace = True)
data.drop_duplicates(subset = "First name", 
                 keep = False, inplace = True)
print(data)

2 个答案:

答案 0 :(得分:1)

您可以执行dict

dct = {}
for word in txt.split():
    if word not in dct:
        dct[word] = 1
    else:
        dct[word] += 1

frequency = pd.Series(dct)

pandas方式:

frequency = pd.Series(txt.split()).value_counts()

答案 1 :(得分:0)

collections.Counter也专为此类任务而设计,可以轻松转换为熊猫数据框。

from collections import Counter
txt = "chilli mango chilli mango grape"
words = txt.split()
counts = Counter(words)  # Counter({'chilli': 2, 'grape': 1, 'mango': 2})
df = pd.DataFrame(counts.items(), columns=["Word", "Frequency"])  # same data as a dataframe

您还可以构建这样的数据框,以避免创建重复项:

mytable = pd.DataFrame(columns=["word", "frequency"]).set_index("word")
for word in words:
    if word in mytable.index:
        mytable.loc[word] += 1
    else:
        mytable.loc[word] = 1

已经说过,如果您删除keep = False(告诉它删除所有所有重复项,包括第一个副本)并将"First name"更改为"word",则您现有的代码应该可以正常工作Sample output as follow: (the one with * are the input from user) Input the number of dice(s): *2 Input the number of faces for the 1st dice: *6 Input the number of faces for the 2nd dice: *6 Probability of 2 = 1/36 Probability of 3 = 2/36 Probability of 4 = 3/36 Probability of 5 = 4/36 Probability of 6 = 5/36 Probability of 7 = 6/36 Probability of 8 = 5/36 Probability of 9 = 4/36 Probability of 10 = 3/36 Probability of 11 = 2/36 Probability of 12 = 1/36 Input the number of dice(s): *5 Input the number of faces for the 1st dice: *1 Input the number of faces for the 2nd dice: *2 Input the number of faces for the 3rd dice: *3 Input the number of faces for the 4th dice: *4 Input the number of faces for the 5th dice: *5 Probability of 5 = 1/120 Probability of 6 = 4/120 Probability of 7 = 9/120 Probability of 8 = 15/120 Probability of 9 = 20/120 Probability of 10 = 22/120 Probability of 11 = 20/120 Probability of 12 = 15/120 Probability of 13 = 9/120 Probability of 14 = 4/120 Probability of 15 = 1/120