Identify and count unique patterns in a pandas dataframe

时间:2017-08-04 11:59:57

标签: python python-3.x pandas

You'll find snippets with reproducible input and an example of desired output at the end of the question.

The challenge:

I have a dataframe like this:

enter image description here

The dataframe has two columns with patterns of 1 and 0 like this:

enter image description here

Or this:

enter image description here

The number of columns will vary, and so will the length of the patterns. However, the only numbers in the dataframe will be 0 or 1.

I would like to identify these patterns, count each occurence of them, and build a dataframe containing the results. To simplify the whole thing, I'd like to focus on the ones, and ignore the zeros. The desired output in this particular case would be:

enter image description here

I'd like the procedure to identify that, as an example, the pattern [1,1,1] occurs two times in column_A, and not at all in column_B. Notice that I've used the sums of the patterns as indexes in the dataframe.

Reproducible input:

import pandas as pd
df = pd.DataFrame({'column_A':[1,1,1,0,0,0,1,0,0,1,1,1],
                   'column_B':[1,1,1,1,1,0,0,0,1,1,0,0]})

colnames = list(df)
df[colnames] = df[colnames].apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=len(df)).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
print(df)

Desired output:

df2 = pd.DataFrame({'pattern':[5,3,2,1],
               'column_A':[0,2,0,1],
               'column_B':[1,0,1,0]})
df2 = df2.set_index(['pattern'])
print(df2)

My attempts so far:

I've been working on a solution that includes nested for loops where I calculate running sums that are reset each time an observation equals zero. It also includes functions such as df.apply(lambda x: x.value_counts()). But it's messy to say the least, and so far not 100% correct.

Thank you for any other suggestions!

1 个答案:

答案 0 :(得分:2)

Here's my attempt:

def fun(ser):
    ser = ser.dropna()
    ser = ser.diff().fillna(ser)
    return ser.value_counts()


df.cumsum().where((df == 1) & (df != df.shift(-1))).apply(fun)
Out: 
     column_A  column_B
1.0       1.0       NaN
2.0       NaN       1.0
3.0       2.0       NaN
5.0       NaN       1.0

The first part (df.cumsum().where((df == 1) & (df != df.shift(-1)))) produces the cumulative sums:

            column_A  column_B
dates                         
2017-08-04       NaN       NaN
2017-08-05       NaN       NaN
2017-08-06       3.0       NaN
2017-08-07       NaN       NaN
2017-08-08       NaN       5.0
2017-08-09       NaN       NaN
2017-08-10       4.0       NaN
2017-08-11       NaN       NaN
2017-08-12       NaN       NaN
2017-08-13       NaN       7.0
2017-08-14       NaN       NaN
2017-08-15       7.0       NaN

So if we ignore the NaNs and take the diffs, we can have the values. That's what the function does: it drops the NaNs and then take the differences so it's not cumulative sum anymore. It finally returns the value counts.