熊猫:向每个组添加行,直到满足条件

时间:2020-07-19 21:28:07

标签: python pandas numpy

我有一个具有以下结构的时间序列数据帧:

const yourFunc = (arg1, arg2) => {
  let split1 = arg1.split(/[:/-]/);
  let split2 = arg2.split(/[:/-]/);

  let output = {};

  for(i=0; i<split1.length; i++){
    output[split1[i]] = split2[i] 
  }
  console.log(output);
  return output;
}


yourFunc("a:b:c", "1:2:3"); // {a: "1", b: "2", c: "3"}
yourFunc("x/y-z", "123/abc-x"); // {x: "123", y: "abc", z: "x"}

export const Product = styled.View` background: #fff; padding: 15px 10px; border-radius: 5px; margin: 5px; flex-direction: row; `; export const ProductTitleContainer = styled.View` font-size: 16px; margin-left: 5px; flex-shrink: 1; `; export const ProductTitle = styled.Text` font-size: 16px; flex-wrap: wrap; `; `;

我想在每个组中添加行,直到每个组具有相同数量的行。 (其中,行数=包含最多行的ID)

对于每个新行,我想用0填充Speaker1和Speaker2列,同时使该ID中其他列中的值保持相同。

所以输出应该是:

| ID | second | speaker1 | speaker2 | company | ... |
|----|--------|----------|----------|---------|-----|
|  A |    1   |     1    |     1    |  name1  |     |
|  A |    2   |     1    |     1    |  name1  |     |
|  A |    3   |     1    |     1    |  name1  |     |
|  B |    1   |     1    |     1    |  name2  |     |
|  B |    2   |     1    |     1    |  name2  |     |
|  B |    3   |     1    |     1    |  name2  |     |
|  B |    4   |     1    |     1    |  name2  |     |
|  C |    1   |     1    |     1    |  name3  |     |
|  C |    2   |     1    |     1    |  name3  |     |

到目前为止,我已经尝试了groupby并应用,但是发现它非常慢,因为此数据框中有很多行和列。

*note that speaker1 and speaker2 can be either 0 or 1, I set all to one for clarity here

有没有办法用numpy做到这一点?像

| ID | second | speaker1 | speaker2 | company | ... |
|:--:|:------:|:--------:|:--------:|:-------:|:---:|
|  A |    1   |     1    |     1    |  name1  |     |
|  A |    2   |     1    |     1    |  name1  |     |
|  A |    3   |     1    |     1    |  name1  |     |
|  A |    4   |     0    |     0    |  name1  |     |
|  B |    1   |     1    |     1    |  name2  |     |
|  B |    2   |     1    |     1    |  name2  |     |
|  B |    3   |     1    |     1    |  name2  |     |
|  B |    4   |     1    |     1    |  name2  |     |
|  C |    1   |     1    |     1    |  name3  |     |
|  C |    2   |     1    |     1    |  name3  |     |
|  C |    3   |     0    |     0    |  name3  |     |
|  C |    4   |     0    |     0    |  name3  |     |

非常感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

使用大熊猫的另一种方法

  1. 构造一个数据框,该数据框是IDsecond的笛卡尔积
  2. 将其外部连接回原始数据框
  3. 根据您的规范填写缺失的值

没有groupby()没有循环。

df = pd.DataFrame({"ID":["A","A","A","B","B","B","B","C","C"],"second":["1","2","3","1","2","3","4","1","2"],"speaker1":["1","1","1","1","1","1","1","1","1"],"speaker2":["1","1","1","1","1","1","1","1","1"],"company":["name1","name1","name1","name2","name2","name2","name2","name3","name3"]})

df2 = pd.DataFrame({"ID":df["ID"].unique()}).assign(foo=1).merge(\
    pd.DataFrame({"second":df["second"].unique()}).assign(foo=1)).drop("foo", 1)\
    .merge(df, on=["ID","second"], how="outer")

df2["company"] = df2["company"].fillna(method="ffill")
df2.fillna(0)

输出

    ID  second  speaker1    speaker2    company
0   A   1   1   1   name1
1   A   2   1   1   name1
2   A   3   1   1   name1
3   A   4   0   0   name1
4   B   1   1   1   name2
5   B   2   1   1   name2
6   B   3   1   1   name2
7   B   4   1   1   name2
8   C   1   1   1   name3
9   C   2   1   1   name3
10  C   3   0   0   name3
11  C   4   0   0   name3