合并连续行熊猫

时间:2021-04-01 14:19:51

标签: python pandas

我想将 Pandas 数据框中的连续行合并为一行。 这是我目前拥有的表:

<头>
id_number document_number 价值 日志日期 co 延迟(天)
4100000739 106782327 0 1/13/2017 14:23 A
4100000739 106788192 1 1/13/2017 16:39 A 0
4100000740 106787500 0 1/13/2017 16:14 A
4100000740 106788227 F 1/13/2017 16:40 A 0
4100000743 109334630 N 2/13/2017 14:22 B
4100000743 109358034 0 2/14/2017 9:24 B 0
4100000743 109358735 1 2/14/2017 9:37 B 0
4100000743 109334630 N 2/13/2017 14:22 C
4100000743 109358034 0 2/14/2017 9:24 C 0
4100000743 109358735 1 2/14/2017 9:37 C 0
4100000743 109334630 N 2/13/2017 14:22 C
4100000743 109358034 0 2/14/2017 9:24 C 0
4100000743 109358735 1 2/14/2017 9:37 C 0
4100000743 109334630 N 2/13/2017 14:22 D
4100000743 109358034 0 2/14/2017 9:24 D 0
4100000743 109358735 1 2/14/2017 9:37 D 0

期望输出:

<头>
id_number document_numb value1 value2 log_date1 log_date2 co 延迟(天)
4100000739 106782327 0 1 1/13/2017 14:23 1/13/2017 16:39 A 0
4100000740 106787500 0 F 1/13/2017 16:14 1/13/2017 16:14 A 0
4100000743 109334630 N 0 2/13/2017 14:22 2/14/2017 9:24 B 0
4100000743 109358034 0 1 2/14/2017 9:24 2/14/2017 9:37 B 0

等等。 本质上,第一个表中的延迟列包含 row(X+1) 和 row(X) 上的 log_date 之间的日期差异。所以第二列的延迟应该包含行 row(x+1) 的值。 行的合并是基于 id_number 和 co 完成的。我真的希望这很清楚

1 个答案:

答案 0 :(得分:0)

  • groupby() id_number 用于逻辑
  • concat(axis=1) 第 n 行和第 (n+1) 行
  • rename(columns=...) 输出你想要的列名
  • 使用 head() 删除组中没有 (n+1) 行的最后一行
import pandas as pd
import io

df = pd.read_csv(io.StringIO("""id_number~document_number~value~log_date~co~delay(days)
4100000739~106782327~0~1/13/2017 14:23~A~
4100000739~106788192~1~1/13/2017 16:39~A~0
4100000740~106787500~0~1/13/2017 16:14~A~
4100000740~106788227~F~1/13/2017 16:40~A~0
4100000743~109334630~N~2/13/2017 14:22~B~
4100000743~109358034~0~2/14/2017 9:24~B~0
4100000743~109358735~1~2/14/2017 9:37~B~0
4100000743~109334630~N~2/13/2017 14:22~C~
4100000743~109358034~0~2/14/2017 9:24~C~0
4100000743~109358735~1~2/14/2017 9:37~C~0
4100000743~109334630~N~2/13/2017 14:22~C~
4100000743~109358034~0~2/14/2017 9:24~C~0
4100000743~109358735~1~2/14/2017 9:37~C~0
4100000743~109334630~N~2/13/2017 14:22~D~
4100000743~109358034~0~2/14/2017 9:24~D~0
4100000743~109358735~1~2/14/2017 9:37~D~0"""), sep="~")


def mergerows(df):
    return pd.concat([df.rename(columns={"value":"value1","log_date":"log_date1"}).drop(columns="delay(days)"), 
           df.shift(-1).loc[:,["value","log_date","delay(days)"]].rename(columns={"value":"value2","log_date":"log_date2"})
                     ], axis=1).head(len(df)-1)


dfc = df.groupby("id_number", as_index=False).apply(mergerows).reset_index(drop=True)

<头>
id_number document_number value1 log_date1 co value2 log_date2 延迟(天)
0 4100000739 106782327 0 1/13/2017 14:23 A 1 1/13/2017 16:39 0
1 4100000740 106787500 0 1/13/2017 16:14 A F 1/13/2017 16:40 0
2 4100000743 109334630 N 2/13/2017 14:22 B 0 2/14/2017 9:24 0
3 4100000743 109358034 0 2/14/2017 9:24 B 1 2/14/2017 9:37 0
4 4100000743 109358735 1 2/14/2017 9:37 B N 2/13/2017 14:22 nan
5 4100000743 109334630 N 2/13/2017 14:22 C 0 2/14/2017 9:24 0
6 4100000743 109358034 0 2/14/2017 9:24 C 1 2/14/2017 9:37 0
7 4100000743 109358735 1 2/14/2017 9:37 C N 2/13/2017 14:22 nan
8 4100000743 109334630 N 2/13/2017 14:22 C 0 2/14/2017 9:24 0
9 4100000743 109358034 0 2/14/2017 9:24 C 1 2/14/2017 9:37 0
10 4100000743 109358735 1 2/14/2017 9:37 C N 2/13/2017 14:22 nan
11 4100000743 109334630 N 2/13/2017 14:22 D 0 2/14/2017 9:24 0
12 4100000743 109358034 0 2/14/2017 9:24 D 1 2/14/2017 9:37 0