我有一个这样的DataFrame:
id_a | date
12 | 2020-01-01
12 | 2020-01-02
13 | 2020-01-01
13 | 2020-01-03
14 | 2020-01-01
14 | 2020-01-02
14 | 2020-01-06
我希望能够基于id_a在每个组的最大日期和最小日期之间进行区分 得到类似的东西
id_a | date | diff
12 | 2020-01-01 | 1
12 | 2020-01-02 | 1
13 | 2020-01-01 | 2
13 | 2020-01-03 | 2
14 | 2020-01-01 | 5
14 | 2020-01-02 | 5
14 | 2020-01-06 | 5
我正在尝试通过类似的方式做到这一点:
df['diff'] = df.groupby('id_a').apply(lambda x: max(x['date']) - min(x['date']))
但是我有点挣扎
我在正确的道路上吗?
答案 0 :(得分:5)
您想要transform
而不是apply
。 np.ptp
也会这样做:
# convert to datetime, ignore if already is
df['date'] = pd.to_datetime(df['date'])
df['date_diff'] = df.groupby('id_a')['date'].transform(np.ptp)
输出:
id_a date date_diff
0 12 2020-01-01 1 days
1 12 2020-01-02 1 days
2 13 2020-01-01 2 days
3 13 2020-01-03 2 days
4 14 2020-01-01 5 days
5 14 2020-01-02 5 days
6 14 2020-01-06 5 days
更新:如果要从max
获取date_a
,并从min
获取date_b
:
groups = df.groupby('id_a')
min_dates = groups['date_b'].transform('min')
max_dates = groups['date_a'].transform('max')
df['date_diff'] = max_dates - min_dates
答案 1 :(得分:3)
我们可以使用.Rmd
,然后将file.src <- file("~/R/sample_proforma_2.Rmd",
open = "r")
file.lines <- readLines(file.src)
rmd.list <- list()
for(i in c("sample_117" ,"sample_118", "sample_119", "sample_121")){
tmp.lines <- file.lines
tmp.lines[4] <- gsub("empty\\+title", gsub("_", " ", toupper(i)), tmp.lines[4])
tmp.lines[6] <- gsub("empty\\+title", gsub("_", " ", toupper(i)), tmp.lines[6])
tmp.lines[9] <- gsub("empty\\+circos", toupper(i), tmp.lines[9])
tmp.lines[15] <- gsub("empty\\+maf", i, tmp.lines[15])
tmp.lines[21] <- gsub("empty\\+tag", i, c("- CIRCOS\n- MAF"))
rmd.list[[i]] <- tmp.lines
}
for (i in names(rmd.list)){
write.table(rmd.list[[i]],
paste0("~/content/sample/",i,".Rmd"),
sep = "",
quote = FALSE,
col.names = FALSE,
row.names = FALSE)
}
与groupby
一起使用,以天为单位得出数值差异。
map
np.timedelta
答案 2 :(得分:0)
您可以尝试加入。但这可能需要您创建其他数据帧。
df_min = df.groupby('id_a', as_index=False).agg({'date':'min'})
df_max = df.groupby('id_a', as_index=False).agg({'date':'max'})
df2 = pd.merge(df,df_max,on=["id_a"],how="inner")
df2 = pd.merge(df2,df_min,on=["id_a"],how="inner")
df2.columns = ['id_a','date','max_date','min_date']
df2['diff'] = df2['max_date'] - df2['min_date']
df2.head()
id_a date max_date min_date diff
0 12 2020-01-01 2020-01-02 2020-01-01 1 days
1 12 2020-01-02 2020-01-02 2020-01-01 1 days
2 13 2020-01-01 2020-01-03 2020-01-01 2 days
3 13 2020-01-03 2020-01-03 2020-01-01 2 days
4 14 2020-01-01 2020-01-06 2020-01-01 5 days