groupby无法识别数字列pandas python的问题

时间:2015-11-07 21:45:58

标签: python pandas dataframe

我有一个我在pd.read_excel读到的Excel数据:

Block   Concentration       Name            Replicate
  1                      Array Marker   
  1                      Array Marker   
  1       100.0        Man5GlcNAc2  
  1       33.0         Man5GlcNAc2  
  1       10.0         Man5GlcNAc2  
  1       100.0        Man6GlcNAc2  
  1       33.0         Man6GlcNAc2  
  1        10.0        Man6GlcNAc2  
  1        100.0      Man7GlcNAc2 D1    
  1        33.0       Man7GlcNAc2 D1    
  1        10.0       Man7GlcNAc2 D1    
  1        100.0     Man7GlcNAc2 D3 
  1         33.0    Man7GlcNAc2 D3  
  1         10.0    Man7GlcNAc2 D3  
...
...
  2        100.0    Man8GlcNAc2 D1D3    
  2         33.0    Man8GlcNAc2 D1D3    
  2         10.0    Man8GlcNAc2 D1D3    
  2         100.0   Man9GlcNAc2 
  2        33.0     Man9GlcNAc2 
  2        10.0     Man9GlcNAc2 
...

所需的输出是:

Block   Concentration       Name            Replicate
  1                      Array Marker         1
  1                      Array Marker         2
  1       100.0        Man5GlcNAc2            1
  1       33.0         Man5GlcNAc2            2
  1       10.0         Man5GlcNAc2            3
  1       100.0        Man6GlcNAc2            1
  1       33.0         Man6GlcNAc2            2
  1        10.0        Man6GlcNAc2            3
  1        100.0      Man7GlcNAc2 D1          1
  1        33.0       Man7GlcNAc2 D1          2
  1        10.0       Man7GlcNAc2 D1          3
  1        100.0     Man7GlcNAc2 D3           1
  1         33.0    Man7GlcNAc2 D3            2
  1         10.0    Man7GlcNAc2 D3            3
...
...
  2        100.0    Man8GlcNAc2 D1D3          1
  2         33.0    Man8GlcNAc2 D1D3          2
  2         10.0    Man8GlcNAc2 D1D3          3
  2         100.0   Man9GlcNAc2               1
  2        33.0     Man9GlcNAc2               2
  2        10.0     Man9GlcNAc2               3
...

我的代码是

data["Replicate"] = data.groupby(["Block", "Name", "Concentration"]).cumcount()+1 

我认为这是有道理的,但我得到的输出不是所需的输出,它低于:

Block   Concentration       Name            Replicate
  1                      Array Marker         1
  1                      Array Marker         2
  1       100.0        Man5GlcNAc2            1
  1       33.0         Man5GlcNAc2            1
  1       10.0         Man5GlcNAc2            1
  1       100.0        Man6GlcNAc2            1
  1       33.0         Man6GlcNAc2            1
  1        10.0        Man6GlcNAc2            1
  1        100.0      Man7GlcNAc2 D1          1
  1        33.0       Man7GlcNAc2 D1          1
  1        10.0       Man7GlcNAc2 D1          1
  1        100.0     Man7GlcNAc2 D3           1
  1         33.0    Man7GlcNAc2 D3            1
  1         10.0    Man7GlcNAc2 D3            1
...
...
  1        100.0    Man8GlcNAc2 D1D3          1
  1         33.0    Man8GlcNAc2 D1D3          1
  1         10.0    Man8GlcNAc2 D1D3          1
  1         100.0   Man9GlcNAc2               1
  1        33.0     Man9GlcNAc2               1
  1        10.0     Man9GlcNAc2               1
...
  1         100.0   Man5GlcNAc2               2
  1        33.0     Man5GlcNAc2               2
  1        10.0     Man5GlcNAc2               2
 ....

复制列是' 1'直到后来的行,我不知道它是如何选择分配数字的行。共有3个块名称组合是相同的,所以我需要指定1,2,3' 1,2,3'当我使用数据透视表时,将它们分开以供以后使用。我已经集中精力了#39;列为字符串类型,因此数字应该不是问题。

2 个答案:

答案 0 :(得分:0)

如果从组中删除“浓度”,您将获得预期的输出。

data["Replicate"] = data.groupby(["Block", "Name"]).cumcount()+1
>>> data

    Block Concentration             Name  Replicate
0       1            ''     Array.Marker          1
1       1            ''     Array.Marker          2
2       1         100.0      Man5GlcNAc2          1
3       1          33.0      Man5GlcNAc2          2
4       1          10.0      Man5GlcNAc2          3
5       1         100.0      Man6GlcNAc2          1
6       1          33.0      Man6GlcNAc2          2
7       1          10.0      Man6GlcNAc2          3
8       1         100.0    Man7GlcNAc2D1          1
9       1          33.0    Man7GlcNAc2D1          2

答案 1 :(得分:0)

cumcount()+1代替功能moving window=3可以#groupby and set rolling count from column Block data["Replicate"] = data.groupby(["Block", "Name"])["Block"].transform(pd.rolling_count, window=3) 使用rolling count

Concentration

格式很奇怪。如果复制数据没有问题,您可以通过将列Name转换为浮动并从文本的开头和结尾分隔列Block Concentration Name Replicate 1 Array Marker 1 Array Marker 1 100.0 Man5GlcNAc2 1 33.0 Man5GlcNAc2 1 10.0 Man5GlcNAc2 1 100.0 Man6GlcNAc2 1 33.0 Man6GlcNAc2 1 10.0 Man6GlcNAc2 1 100.0 Man7GlcNAc2 D1 1 33.0 Man7GlcNAc2 D1 1 10.0 Man7GlcNAc2 D1 1 100.0 Man7GlcNAc2 D3 1 33.0 Man7GlcNAc2 D3 1 10.0 Man7GlcNAc2 D3 中的空格来修复它。

#convert column Concentration to float
data['Concentration'] = data['Concentration'].astype(float)
#strip first and last whitespaces
data['Name'] = data['Name'].str.strip()

#groupby and set rolling count from column Block
data["Replicate"] = data.groupby(["Block", "Name"])["Block"].transform(pd.rolling_count, window=3) 
    Block Concentration            Name  Replicate
0       1                  Array Marker          1
1       1                  Array Marker          2
2       1           100     Man5GlcNAc2          1
3       1            33     Man5GlcNAc2          2
4       1            10     Man5GlcNAc2          3
5       1           100     Man6GlcNAc2          1
6       1            33     Man6GlcNAc2          2
7       1            10     Man6GlcNAc2          3
8       1           100  Man7GlcNAc2 D1          1
9       1            33  Man7GlcNAc2 D1          2
10      1            10  Man7GlcNAc2 D1          3
11      1           100  Man7GlcNAc2 D3          1
12      1            33  Man7GlcNAc2 D3          2
13      1            10  Man7GlcNAc2 D3          3
fsr