熊猫数据框分组功能以计算日期差

时间:2020-03-09 17:05:31

标签: python pandas

我有一个这样的DataFrame:

id_a | date

12   | 2020-01-01
12   | 2020-01-02
13   | 2020-01-01
13   | 2020-01-03
14   | 2020-01-01
14   | 2020-01-02
14   | 2020-01-06

我希望能够基于id_a在每个组的最大日期和最小日期之间进行区分 得到类似的东西

id_a | date       | diff

12   | 2020-01-01 | 1
12   | 2020-01-02 | 1
13   | 2020-01-01 | 2
13   | 2020-01-03 | 2
14   | 2020-01-01 | 5
14   | 2020-01-02 | 5
14   | 2020-01-06 | 5

我正在尝试通过类似的方式做到这一点:

df['diff'] = df.groupby('id_a').apply(lambda x: max(x['date']) - min(x['date']))

但是我有点挣扎

我在正确的道路上吗?

3 个答案:

答案 0 :(得分:5)

您想要transform而不是applynp.ptp也会这样做:

 # convert to datetime, ignore if already is
 df['date'] = pd.to_datetime(df['date'])

 df['date_diff'] = df.groupby('id_a')['date'].transform(np.ptp)

输出:

   id_a       date date_diff
0    12 2020-01-01    1 days
1    12 2020-01-02    1 days
2    13 2020-01-01    2 days
3    13 2020-01-03    2 days
4    14 2020-01-01    5 days
5    14 2020-01-02    5 days
6    14 2020-01-06    5 days

更新:如果要从max获取date_a,并从min获取date_b

groups = df.groupby('id_a')
min_dates = groups['date_b'].transform('min')
max_dates = groups['date_a'].transform('max')

df['date_diff'] = max_dates - min_dates

答案 1 :(得分:3)

我们可以使用.Rmd,然后将file.src <- file("~/R/sample_proforma_2.Rmd", open = "r") file.lines <- readLines(file.src) rmd.list <- list() for(i in c("sample_117" ,"sample_118", "sample_119", "sample_121")){ tmp.lines <- file.lines tmp.lines[4] <- gsub("empty\\+title", gsub("_", " ", toupper(i)), tmp.lines[4]) tmp.lines[6] <- gsub("empty\\+title", gsub("_", " ", toupper(i)), tmp.lines[6]) tmp.lines[9] <- gsub("empty\\+circos", toupper(i), tmp.lines[9]) tmp.lines[15] <- gsub("empty\\+maf", i, tmp.lines[15]) tmp.lines[21] <- gsub("empty\\+tag", i, c("- CIRCOS\n- MAF")) rmd.list[[i]] <- tmp.lines } for (i in names(rmd.list)){ write.table(rmd.list[[i]], paste0("~/content/sample/",i,".Rmd"), sep = "", quote = FALSE, col.names = FALSE, row.names = FALSE) } groupby一起使用,以天为单位得出数值差异。

map

np.timedelta

答案 2 :(得分:0)

您可以尝试加入。但这可能需要您创建其他数据帧。

df_min = df.groupby('id_a', as_index=False).agg({'date':'min'})
df_max = df.groupby('id_a', as_index=False).agg({'date':'max'})

df2 = pd.merge(df,df_max,on=["id_a"],how="inner")
df2 = pd.merge(df2,df_min,on=["id_a"],how="inner")

df2.columns = ['id_a','date','max_date','min_date']
df2['diff'] = df2['max_date'] - df2['min_date']

df2.head()

   id_a       date   max_date   min_date   diff
0    12 2020-01-01 2020-01-02 2020-01-01 1 days
1    12 2020-01-02 2020-01-02 2020-01-01 1 days
2    13 2020-01-01 2020-01-03 2020-01-01 2 days
3    13 2020-01-03 2020-01-03 2020-01-01 2 days
4    14 2020-01-01 2020-01-06 2020-01-01 5 days