我想将 Pandas 数据框中的连续行合并为一行。 这是我目前拥有的表:
id_number | document_number | 价值 | 日志日期 | co | 延迟(天) |
---|---|---|---|---|---|
4100000739 | 106782327 | 0 | 1/13/2017 14:23 | A | |
4100000739 | 106788192 | 1 | 1/13/2017 16:39 | A | 0 |
4100000740 | 106787500 | 0 | 1/13/2017 16:14 | A | |
4100000740 | 106788227 | F | 1/13/2017 16:40 | A | 0 |
4100000743 | 109334630 | N | 2/13/2017 14:22 | B | |
4100000743 | 109358034 | 0 | 2/14/2017 9:24 | B | 0 |
4100000743 | 109358735 | 1 | 2/14/2017 9:37 | B | 0 |
4100000743 | 109334630 | N | 2/13/2017 14:22 | C | |
4100000743 | 109358034 | 0 | 2/14/2017 9:24 | C | 0 |
4100000743 | 109358735 | 1 | 2/14/2017 9:37 | C | 0 |
4100000743 | 109334630 | N | 2/13/2017 14:22 | C | |
4100000743 | 109358034 | 0 | 2/14/2017 9:24 | C | 0 |
4100000743 | 109358735 | 1 | 2/14/2017 9:37 | C | 0 |
4100000743 | 109334630 | N | 2/13/2017 14:22 | D | |
4100000743 | 109358034 | 0 | 2/14/2017 9:24 | D | 0 |
4100000743 | 109358735 | 1 | 2/14/2017 9:37 | D | 0 |
期望输出:
id_number | document_numb | value1 | value2 | log_date1 | log_date2 | co | 延迟(天) |
---|---|---|---|---|---|---|---|
4100000739 | 106782327 | 0 | 1 | 1/13/2017 14:23 | 1/13/2017 16:39 | A | 0 |
4100000740 | 106787500 | 0 | F | 1/13/2017 16:14 | 1/13/2017 16:14 | A | 0 |
4100000743 | 109334630 | N | 0 | 2/13/2017 14:22 | 2/14/2017 9:24 | B | 0 |
4100000743 | 109358034 | 0 | 1 | 2/14/2017 9:24 | 2/14/2017 9:37 | B | 0 |
等等。 本质上,第一个表中的延迟列包含 row(X+1) 和 row(X) 上的 log_date 之间的日期差异。所以第二列的延迟应该包含行 row(x+1) 的值。 行的合并是基于 id_number 和 co 完成的。我真的希望这很清楚
答案 0 :(得分:0)
groupby()
id_number 用于逻辑concat(axis=1)
第 n 行和第 (n+1) 行rename(columns=...)
输出你想要的列名head()
删除组中没有 (n+1) 行的最后一行import pandas as pd
import io
df = pd.read_csv(io.StringIO("""id_number~document_number~value~log_date~co~delay(days)
4100000739~106782327~0~1/13/2017 14:23~A~
4100000739~106788192~1~1/13/2017 16:39~A~0
4100000740~106787500~0~1/13/2017 16:14~A~
4100000740~106788227~F~1/13/2017 16:40~A~0
4100000743~109334630~N~2/13/2017 14:22~B~
4100000743~109358034~0~2/14/2017 9:24~B~0
4100000743~109358735~1~2/14/2017 9:37~B~0
4100000743~109334630~N~2/13/2017 14:22~C~
4100000743~109358034~0~2/14/2017 9:24~C~0
4100000743~109358735~1~2/14/2017 9:37~C~0
4100000743~109334630~N~2/13/2017 14:22~C~
4100000743~109358034~0~2/14/2017 9:24~C~0
4100000743~109358735~1~2/14/2017 9:37~C~0
4100000743~109334630~N~2/13/2017 14:22~D~
4100000743~109358034~0~2/14/2017 9:24~D~0
4100000743~109358735~1~2/14/2017 9:37~D~0"""), sep="~")
def mergerows(df):
return pd.concat([df.rename(columns={"value":"value1","log_date":"log_date1"}).drop(columns="delay(days)"),
df.shift(-1).loc[:,["value","log_date","delay(days)"]].rename(columns={"value":"value2","log_date":"log_date2"})
], axis=1).head(len(df)-1)
dfc = df.groupby("id_number", as_index=False).apply(mergerows).reset_index(drop=True)
id_number | document_number | value1 | log_date1 | co | value2 | log_date2 | 延迟(天) | |
---|---|---|---|---|---|---|---|---|
0 | 4100000739 | 106782327 | 0 | 1/13/2017 14:23 | A | 1 | 1/13/2017 16:39 | 0 |
1 | 4100000740 | 106787500 | 0 | 1/13/2017 16:14 | A | F | 1/13/2017 16:40 | 0 |
2 | 4100000743 | 109334630 | N | 2/13/2017 14:22 | B | 0 | 2/14/2017 9:24 | 0 |
3 | 4100000743 | 109358034 | 0 | 2/14/2017 9:24 | B | 1 | 2/14/2017 9:37 | 0 |
4 | 4100000743 | 109358735 | 1 | 2/14/2017 9:37 | B | N | 2/13/2017 14:22 | nan |
5 | 4100000743 | 109334630 | N | 2/13/2017 14:22 | C | 0 | 2/14/2017 9:24 | 0 |
6 | 4100000743 | 109358034 | 0 | 2/14/2017 9:24 | C | 1 | 2/14/2017 9:37 | 0 |
7 | 4100000743 | 109358735 | 1 | 2/14/2017 9:37 | C | N | 2/13/2017 14:22 | nan |
8 | 4100000743 | 109334630 | N | 2/13/2017 14:22 | C | 0 | 2/14/2017 9:24 | 0 |
9 | 4100000743 | 109358034 | 0 | 2/14/2017 9:24 | C | 1 | 2/14/2017 9:37 | 0 |
10 | 4100000743 | 109358735 | 1 | 2/14/2017 9:37 | C | N | 2/13/2017 14:22 | nan |
11 | 4100000743 | 109334630 | N | 2/13/2017 14:22 | D | 0 | 2/14/2017 9:24 | 0 |
12 | 4100000743 | 109358034 | 0 | 2/14/2017 9:24 | D | 1 | 2/14/2017 9:37 | 0 |