熊猫 - 在每日人口普查数据中查找唯一条目

时间:2016-09-01 17:22:46

标签: python pandas dataframe grouping data-cleaning

我的人口普查数据整整一个月都是如此,我想知道这个月有多少独特的囚犯。这些信息每天都有,所以有倍数。

  _id,Date,Gender,Race,Age at Booking,Current Age
    1,2016-06-01,M,W,32,33
    2,2016-06-01,M,B,25,27
    3,2016-06-01,M,W,31,33

我现在的方法是按天将它们分组,然后将未考虑的那些添加到DataFrame中。我的问题是如何使用相同的信息来说明两个人。它们都不会被添加到新的DataFrame,因为其中一个已经存在?我想弄清楚在这段时间里监狱里有多少人。

_id是增量式的,例如这里是第二天的一些数据

2323,2016-06-02,M,B,20,21
2324,2016-06-02,M,B,44,45
2325,2016-06-02,M,B,22,22
2326,2016-06-02,M,B,38,39

链接到此处的数据集:https://data.wprdc.org/dataset/allegheny-county-jail-daily-census

2 个答案:

答案 0 :(得分:1)

你可以使用df.drop_duplicates()来返回只有唯一值的DataFrame,然后计算条目。

这样的事情应该有效:

import pandas as pd
df = pd.read_csv('inmates_062016.csv', index_col=0, parse_dates=True)

uniqueDF = df.drop_duplicates()
countUniques = len(uniqueDF.index)
print(countUniques)

结果:

>> 11845

Pandas drop_duplicates Documentation

Inmates June 2016 CSV

这种方法/数据存在的问题是,可能会有许多个别囚犯,他们的年龄/性别/种族相同,会被过滤掉。

答案 1 :(得分:1)

我认为这里的诀窍是尽可能地分组并检查这些(小)小组中的差异:

inmates = pd.read_csv('inmates.csv')

# group by everything except _id and count number of entries
grouped = inmates.groupby(
    ['Gender', 'Race', 'Age at Booking', 'Current Age', 'Date']).count()

# pivot the dates out and transpose - this give us the number of each
# combination for each day
grouped = grouped.unstack().T.fillna(0)

# get the difference between each day of the month - the assumption here
# being that a negative number means someone left, 0 means that nothing
# has changed and positive means that someone new has come in. As you
# mentioned yourself, that isn't necessarily true
diffed = grouped.diff()

# replace the first day of the month with the grouped numbers to give
# the number in each group at the start of the month
diffed.iloc[0, :] = grouped.iloc[0, :]

# sum only the positive numbers in each row to count those that have
# arrived but ignore those that have left
diffed['total'] = diffed.apply(lambda x: x[x > 0].sum(), axis=1)

# sum total column
diffed['total'].sum()  # 3393