查找数据框中每个用户的连续零的最长游记

时间:2018-07-31 04:53:13

标签: python binary counting run-length-encoding

我正在寻找在DataFrame中按用户分组结果的连续零的最大游程。我对使用RLE感兴趣。

样本输入:

user--day--usage
A ----- 1 ------ 0
A ----- 2 ------ 0
A ----- 3 ------ 1
B ----- 1 ------ 0
B ----- 2 ------ 1
B ----- 3 ------ 0

所需的输出

user --- longest_run
a----2
b----1

mydata <- mydata[order(mydata$user, mydata$day),]
user <- unique(mydata$user)
d2 <- data.frame(matrix(NA, ncol = 2, nrow = length(user)))
names(d2) <- c("user", "longest_no_usage")
d2$user <- user
for (i in user) {
  if (0 %in% mydata$usage[mydata$user == i]) {
    run <- rle(mydata$usage[mydata$user == i]) #Run Length Encoding
    d2$longest_no_usage[d2$user == i] <- max(run$length[run$values == 0])
  } else {
    d2$longest_no_usage[d2$user == i] <- 0 #some users did not have no-usage days
  }
}
d2 <- d2[order(-d2$longest_no_usage),]

这在R中有效,但我想在python中做同样的事情,我完全感到困惑

4 个答案:

答案 0 :(得分:2)

首先将groupby与size by列用户,用法和帮助器Series一起用于连续值: 打印(df)   用户日使用量 0 A 1 0 1 A 2 0 2 A 3 1 3 B 1 0 4 B 2 1 5 B 3 0 6 C 1 1 df1 =(df.groupby([df ['user'],                    df ['usage']。rename('val'),                    df ['usage']。ne(df ['usage']。shift())。cumsum()])         。尺寸()         .to_frame(name ='longest_run')) 打印(df1)                 longest_run 用户使用率 0 1 2      1 2 1 B 0 3 1          5 1      1 4 1 C 1 6 1 然后仅过滤零行,获取max并为附加非0组添加reindex: df2 =(df1.query('val == 0')           .max(等级= 0)           .reindex(df ['user']。unique(),fill_value = 0)           .reset_index()) 打印(df2)   用户longest_run 0 A 2 1 B 1 2 C 0 详情: 打印(df ['usage']。ne(df ['usage']。shift())。cumsum()) 0 1 1 1 2 2 3 3 4 4 5 5 6 6 名称:用法,dtype:int32

答案 1 :(得分:0)

我认为以下内容可以满足您的需求,其中consecutive_zero函数是对最高答案here的改编。

希望这会有所帮助!

import pandas as pd
from itertools import groupby

df = pd.DataFrame([['A', 1], ['A', 0], ['A', 0], ['B', 0],['B',1],['C',2]], 
                  columns=["user", "usage"])

def len_iter(items):
    return sum(1 for _ in items)

def consecutive_zero(data):
    x = list((len_iter(run) for val, run in groupby(data) if val==0))
    if len(x)==0: return 0 
    else: return max(x)

df.groupby('user').apply(lambda x: consecutive_zero(x['usage']))

输出:

user
A    2
B    1
C    0
dtype: int64

答案 2 :(得分:0)

如果您的数据集很大且速度至关重要,则可能需要尝试高性能的pyrle库。

设置:

# pip install pyrle
# or 
# conda install -c bioconda pyrle
import numpy as np
np.random.seed(0)
import pandas as pd
from pyrle import Rle
size = int(1e7)
number = np.random.randint(2, size=size)
user = np.random.randint(5, size=size)
df = pd.DataFrame({"User": np.sort(user), "Number": number})
df
#          User  Number
# 0           0       0
# 1           0       1
# 2           0       1
# 3           0       0
# 4           0       1
# ...       ...     ...
# 9999995     4       1
# 9999996     4       1
# 9999997     4       0
# 9999998     4       0
# 9999999     4       1
# 
# [10000000 rows x 2 columns]

执行:

for u, udf in df.groupby("User"):
    r = Rle(udf.Number)
    is_0 = r.values == 0
    print("User", u, "Max", np.max(r.runs[is_0]))
# (Wall time: 1.41 s)


# User 0 Max 20
# User 1 Max 23
# User 2 Max 20
# User 3 Max 22
# User 4 Max 23

答案 3 :(得分:0)

获取序列上连续零的最大数量:

def max0(sr):
     return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)


max0(pd.Series([1,0,0,0,0,2,3]))

4