我试图找出我的数据框中有多少个重复的句子,这些重复的句子都是重复的多个完全匹配的句子,我使用的是 Dataframe.Duplicated ,但是它忽略了这些句子,我想要它而不是打印重复的句子,而只是打印重复的句子一个及其出现的次数
我正在尝试的代码是
wdata = pd.read_csv(fileinput, nrows=0).columns[0]
skip = int(wdata.count(' ') == 0)
wdata = pd.read_csv(fileinput, names=['sentences'], skiprows=skip)
data=wdata[wdata.duplicated()]
print(data)
#dataframe example
#hi how are you
#hello sam how are you doing
#hello sam how are you doing
#helll Alex how are you doing
#hello sam how are you doing
#let us go eat
#where is the dog
#let us go eat
我希望输出为
#hello sam how are you doing 3
#let us go eat 2
具有重复功能的我得到此输出
#hello sam how are you doing
#hello sam how are you doing
#let us go eat
这是我得到第二个答案的输出
wdata = pd.read_csv(fileinput, nrows=0).columns[0]
skip = int(wdata.count(' ') == 0)
wdata = pd.read_csv(fileinput, names=['sentences'], skiprows=skip)
data=wdata.groupby(['sentences']).size().reset_index(name='counts')
# sentences counts
#0 hello Alex how are you doing 1
#1 hello sam how are you doing 3
#2 hi how are you 1
#3 let us go eat 1
#4 let us go eat 1
#5 where is the dog 1
我希望输出为
#hello sam how are you doing 3
#let us go eat 2
答案 0 :(得分:2)
由于存在空格,解决方案是使用Series.str.strip
和GroupBy.size
将其删除:
data=wdata.groupby(wdata['sentences'].str.strip()).size().reset_index(name='counts')
然后按boolean indexing
进行过滤:
data = data[data['counts'].gt(1)]
另一个想法是对系列使用Series.value_counts
进行过滤,最后将其转换为2列DataFrame:
s = wdata['sentences'].str.strip().value_counts()
data = s[s.gt(1)].rename_axis('sentences').reset_index(name='counts')