我的人口普查数据整整一个月都是如此,我想知道这个月有多少独特的囚犯。这些信息每天都有,所以有倍数。
_id,Date,Gender,Race,Age at Booking,Current Age
1,2016-06-01,M,W,32,33
2,2016-06-01,M,B,25,27
3,2016-06-01,M,W,31,33
我现在的方法是按天将它们分组,然后将未考虑的那些添加到DataFrame中。我的问题是如何使用相同的信息来说明两个人。它们都不会被添加到新的DataFrame,因为其中一个已经存在?我想弄清楚在这段时间里监狱里有多少人。
_id是增量式的,例如这里是第二天的一些数据
2323,2016-06-02,M,B,20,21
2324,2016-06-02,M,B,44,45
2325,2016-06-02,M,B,22,22
2326,2016-06-02,M,B,38,39
链接到此处的数据集:https://data.wprdc.org/dataset/allegheny-county-jail-daily-census
答案 0 :(得分:1)
你可以使用df.drop_duplicates()
来返回只有唯一值的DataFrame,然后计算条目。
这样的事情应该有效:
import pandas as pd
df = pd.read_csv('inmates_062016.csv', index_col=0, parse_dates=True)
uniqueDF = df.drop_duplicates()
countUniques = len(uniqueDF.index)
print(countUniques)
结果:
>> 11845
Pandas drop_duplicates Documentation
这种方法/数据存在的问题是,可能会有许多个别囚犯,他们的年龄/性别/种族相同,会被过滤掉。
答案 1 :(得分:1)
我认为这里的诀窍是尽可能地分组并检查这些(小)小组中的差异:
inmates = pd.read_csv('inmates.csv')
# group by everything except _id and count number of entries
grouped = inmates.groupby(
['Gender', 'Race', 'Age at Booking', 'Current Age', 'Date']).count()
# pivot the dates out and transpose - this give us the number of each
# combination for each day
grouped = grouped.unstack().T.fillna(0)
# get the difference between each day of the month - the assumption here
# being that a negative number means someone left, 0 means that nothing
# has changed and positive means that someone new has come in. As you
# mentioned yourself, that isn't necessarily true
diffed = grouped.diff()
# replace the first day of the month with the grouped numbers to give
# the number in each group at the start of the month
diffed.iloc[0, :] = grouped.iloc[0, :]
# sum only the positive numbers in each row to count those that have
# arrived but ignore those that have left
diffed['total'] = diffed.apply(lambda x: x[x > 0].sum(), axis=1)
# sum total column
diffed['total'].sum() # 3393