Python:在使用.Value_counts()时如何保留所有数据?

时间:2016-06-08 20:26:57

标签: python python-3.x pandas

.Value_counts()删除了我的其余数据。分析我的数据而不丢失其他信息?或者是否有另一个我可以使用的计数器代码,它不会删除我的其余数据列?

这是我的代码:

from pandas import DataFrame, read_csv
import pandas as pd  
f1 = pd.read_csv('lastlogonuser.txt', sep='\t', encoding='latin1')
f2 = pd.read_csv('UserAccounts.csv', sep=',', encoding ='latin1')
f2 = f2.rename(columns={'Shortname':'User Name'})
f = pd.concat([f1, f2])
counts = f['User Name'].value_counts()
f = counts[counts == 1] 
f 

运行代码时,我得到类似的内容:

sample534         1
sample987         1
sample342         1
sample321         1
sample123         1

我想要像:

   User Name    Description                    CN Account
1  sample534    Journal Mailbox managed by         
1  sample987    Journal Mailbox managed by    
1  sample342    Journal Mailbox managed by   
1  sample321    Journal Mailbox managed by 
1  sample123    Journal Mailbox managed by 

我正在使用的数据样本:

enter code here
Account User Name User CN                       Description
ENABLED MBJ29     CN=MBJ29,CN=Users             Journal Mailbox managed by  
ENABLED MBJ14     CN=MBJ14,CN=Users             Journal Mailbox managed by
ENABLED MBJ08     CN=MBJ30,CN=Users             Journal Mailbox managed by   
ENABLED MBJ07     CN=MBJ07,CN=Users             Journal Mailbox managed by 

1 个答案:

答案 0 :(得分:1)

您可以使用DataFrame.duplicated确定哪些行是重复的,然后使用loc进行过滤:

f = f.loc[~f.duplicated(subset=['User Name'], keep=False), :]

subset参数指定只查找'User Name'列中的重复项。 keep=False参数指定标记所有重复项。由于duplicated会为重复项返回True,因此我使用~对其进行了否定。

当在相当大的DataFrame上进行大量重复测试时,这似乎比groupby更有效:

%timeit f.loc[~f.duplicated(subset=['User Name'], keep=False), :]
100 loops, best of 3: 17.4 ms per loop

%timeit f.groupby('User Name').filter(lambda x: len(x) == 1)
1 loop, best of 3: 6.78 s per loop