.Value_counts()删除了我的其余数据。分析我的数据而不丢失其他信息?或者是否有另一个我可以使用的计数器代码,它不会删除我的其余数据列?
这是我的代码:
from pandas import DataFrame, read_csv
import pandas as pd
f1 = pd.read_csv('lastlogonuser.txt', sep='\t', encoding='latin1')
f2 = pd.read_csv('UserAccounts.csv', sep=',', encoding ='latin1')
f2 = f2.rename(columns={'Shortname':'User Name'})
f = pd.concat([f1, f2])
counts = f['User Name'].value_counts()
f = counts[counts == 1]
f
运行代码时,我得到类似的内容:
sample534 1
sample987 1
sample342 1
sample321 1
sample123 1
我想要像:
User Name Description CN Account
1 sample534 Journal Mailbox managed by
1 sample987 Journal Mailbox managed by
1 sample342 Journal Mailbox managed by
1 sample321 Journal Mailbox managed by
1 sample123 Journal Mailbox managed by
我正在使用的数据样本:
enter code here
Account User Name User CN Description
ENABLED MBJ29 CN=MBJ29,CN=Users Journal Mailbox managed by
ENABLED MBJ14 CN=MBJ14,CN=Users Journal Mailbox managed by
ENABLED MBJ08 CN=MBJ30,CN=Users Journal Mailbox managed by
ENABLED MBJ07 CN=MBJ07,CN=Users Journal Mailbox managed by
答案 0 :(得分:1)
您可以使用DataFrame.duplicated
确定哪些行是重复的,然后使用loc
进行过滤:
f = f.loc[~f.duplicated(subset=['User Name'], keep=False), :]
subset
参数指定只查找'User Name'
列中的重复项。 keep=False
参数指定标记所有重复项。由于duplicated
会为重复项返回True
,因此我使用~
对其进行了否定。
当在相当大的DataFrame上进行大量重复测试时,这似乎比groupby
更有效:
%timeit f.loc[~f.duplicated(subset=['User Name'], keep=False), :]
100 loops, best of 3: 17.4 ms per loop
%timeit f.groupby('User Name').filter(lambda x: len(x) == 1)
1 loop, best of 3: 6.78 s per loop