This post and this post have gotten me close, but I haven't been able to solve my problem.
I have a df that looks like:
2017-04-03 2017-04-04 2017-04-05 2017-04-06
id
0 0.0 active 0.0 0.0
1 0.0 active 0.0 active
2 0.0 0.0 0.0 0.0
I want to count the zeros across each row and put them into a string to code the data, but the count needs to reset whenever there aren't consecutive zeros.
For the above df, the output df would look like:
2017-04-03 2017-04-04 2017-04-05 2017-04-06
id
0 inactive_1 active inactive_1 inactive_2
1 inactive_1 active inactive_1 active
2 inactive_1 inactive_2 inactive_3 inactive_4
this function gets me very close, but doesn't account for reseting the cumsum, it just sums for all instances of zero in the row.
def inactive(s):
np.where(s == 0, 'inactive_' + (s.eq(0).cumsum()).astype(str), s)
df.apply(inactive, 1)
答案 0 :(得分:2)
A little roundabout, but this can be done by applying a groupby
operation on each row, and then using np.where
to selectively apply your values to the original.
def f(x):
return x.groupby(x.ne(x.shift()).cumsum()).cumcount() + 1
i = df.apply(pd.to_numeric, errors='coerce')
j = 'inactive_' + i.apply(f, axis=1).astype(str)
df[:] = np.where(i.ne(0), df.values, j)
df
2017-04-03 2017-04-04 2017-04-05 2017-04-06
id
0 inactive_1 active inactive_1 inactive_2
1 inactive_1 active inactive_1 active
2 inactive_1 inactive_2 inactive_3 inactive_4
答案 1 :(得分:1)
You can use:
#convert to numeric, NaNs for non numeric
df1 = df.apply(pd.to_numeric, errors='coerce')
#count consecutive values with reset
a = df1 == 0
b = a.cumsum(axis=1)
c = b-b.where(~a, axis=1).ffill(axis=1).fillna(0).astype(int)
print (c)
2017-04-03 2017-04-04 2017-04-05 2017-04-06
id
0 1 0 1 2
1 1 0 1 0
2 1 2 3 4
#replace by mask
df = df.mask(c != 0, 'inactive_' + c.astype(str))
print (df)
2017-04-03 2017-04-04 2017-04-05 2017-04-06
id
0 inactive_1 active inactive_1 inactive_2
1 inactive_1 active inactive_1 active
2 inactive_1 inactive_2 inactive_3 inactive_4
Timings:
np.random.seed(425)
df = pd.DataFrame(np.random.choice([0, 'active'], size=(100000, 300)))
In [4]: %timeit (jez(df))
1 loop, best of 3: 1min 40s per loop
In [5]: %timeit col(df)
1 loop, best of 3: 5min 54s per loop
def jez(df):
df1 = df.apply(pd.to_numeric, errors='coerce')
#count consecutive values
a = df1 == 0
b = a.cumsum(axis=1)
c = b-b.where(~a, axis=1).ffill(axis=1).fillna(0).astype(int)
#replace by mask
return df.mask(c != 0, 'inactive_' + c.astype(str))
def f(x):
return x.groupby(x.ne(x.shift()).cumsum()).cumcount() + 1
def col(df):
i = df.apply(pd.to_numeric, errors='coerce')
j = 'inactive_' + i.apply(f, axis=1).astype(str)
df[:] = np.where(i.ne(0), df.values, j)
return(df)
Caveat:
Performance really depend on the data.