Pandas groupby添加带有交叉引用的行

时间:2014-05-02 20:23:04

标签: python pandas

给出一个像这样的数据帧的简单例子:

sample  chrom   start   stop    count   psi5
sampleA chr1    100     200     75      0.75
sampleA chr1    100     250     25      0.25
sampleB chr1    100     200     50      1.0
sampleC chr1    100     250     50      1.0
sampleD chr1    100     300     1       NaN

如何为每个样本添加行,而不是对第3列的所有唯一值(从0开始)进行观察?

sampleA chr1    100 200 75  0.75
sampleA chr1    100 250 25  0.25
sampleB chr1    100 200 50  1.0
sampleC chr1    100 250 50  1.0
sampleD chr1    100 300 1   NaN
sampleA chr1    100 300 0   0
sampleB chr1    100 250 0   0
sampleB chr1    100 300 0   0
sampleC chr1    100 200 0   0
sampleC chr1    100 300 0   0
sampleD chr1    100 200 NaN NaN
sampleD chr1    100 250 NaN NaN

因此sampleA没有对第3列= 300进行观察,因此我们在第4列和第5列中添加了带零的行。但是{{1} },它只有sampleD count,所以它没有通过标准,因此1的值是NaN,因为我可以跳过任何一个可能从这里做一个数据透视表,用na填充emtpy,或用psi5添加一行。

这段代码按照一个小例子做了我想做的事:https://gist.github.com/olgabot/1b4234c28b245e52bfc0

但它没有很好的矢量化。

2 个答案:

答案 0 :(得分:2)

我可能会使用stackunstack以矢量化方式执行此操作。 sampleD的NaN非常棘手,因为我需要使用由unstacking引起的Nan来填充stop列。但你可以在开头摆脱sampleD,最后将NaN添加到sampleD(这就是我要做的):

一下子:

df = df.set_index(['sample','chrom','start','stop'])
df = df.unstack(['sample','chrom','start']).fillna(0)
df = df.stack(['sample','chrom','start']).reset_index()
df.loc[df.sample == 'sampleD',['count','psi5']] = np.nan
print df

   stop   sample chrom  start  count  psi5
0    200  sampleA  chr1    100     75  0.75
1    200  sampleB  chr1    100     50  1.00
2    200  sampleC  chr1    100      0  0.00
3    200  sampleD  chr1    100    NaN   NaN
4    250  sampleA  chr1    100     25  0.25
5    250  sampleB  chr1    100      0  0.00
6    250  sampleC  chr1    100     50  1.00
7    250  sampleD  chr1    100    NaN   NaN
8    300  sampleA  chr1    100      0  0.00
9    300  sampleB  chr1    100      0  0.00
10   300  sampleC  chr1    100      0  0.00
11   300  sampleD  chr1    100    NaN   NaN

一步一步

1)将['sample','chrom','start','stop']设为索引:

df = df.set_index(['sample','chrom','start','stop'])
print df

                          count  psi5
sample  chrom start stop             
sampleA chr1  100   200      75  0.75
                    250      25  0.25
sampleB chr1  100   200      50  1.00
sampleC chr1  100   250      50  1.00
sampleD chr1  100   300       1   NaN

2)除了停止之外的所有索引的取消堆栈并用零填充由unstack创建的缺失值:

df = df.unstack(['sample','chrom','start'])
print df

          count                                psi5                           
sample  sampleA  sampleB  sampleC  sampleD  sampleA  sampleB  sampleC  sampleD
chrom      chr1     chr1     chr1     chr1     chr1     chr1     chr1     chr1
start       100      100      100      100      100      100      100      100
stop                                                                          
200          75       50      NaN      NaN     0.75        1      NaN      NaN
250          25      NaN       50      NaN     0.25      NaN        1      NaN
300         NaN      NaN      NaN        1      NaN      NaN      NaN      NaN

df = df.fillna(0)
print df

          count                                psi5                           
sample  sampleA  sampleB  sampleC  sampleD  sampleA  sampleB  sampleC  sampleD
chrom      chr1     chr1     chr1     chr1     chr1     chr1     chr1     chr1
start       100      100      100      100      100      100      100      100
stop                                                                          
200          75       50        0        0     0.75        1        0        0
250          25        0       50        0     0.25        0        1        0
300           0        0        0        1     0.00        0        0        0

3)现在重新打包回到旧的面板表格,但现在每个分组的停止值分别为200,250和300:

df = df.stack(['sample','chrom','start']).reset_index()
print df 

    stop   sample chrom  start  count  psi5
0    200  sampleA  chr1    100     75  0.75
1    200  sampleB  chr1    100     50  1.00
2    200  sampleC  chr1    100      0  0.00
3    200  sampleD  chr1    100      0  0.00
4    250  sampleA  chr1    100     25  0.25
5    250  sampleB  chr1    100      0  0.00
6    250  sampleC  chr1    100     50  1.00
7    250  sampleD  chr1    100      0  0.00
8    300  sampleA  chr1    100      0  0.00
9    300  sampleB  chr1    100      0  0.00
10   300  sampleC  chr1    100      0  0.00
11   300  sampleD  chr1    100      1  0.00

4)为样本D添加NaN:

df.loc[df.sample == 'sampleD',['count','psi5']] = np.nan

答案 1 :(得分:0)

假设您正在使用pandas并且您的数据是.csv文件

#!/usr/bin/env python

import pandas as pd

cols = ['sample', 'chrom', 'start', 'stop', 'count', 'psi5']    #  Name Columns
data = pd.read_csv('./stuff.txt', header=False, names=cols, sep=',',index_col=False)    #  Import data

for letters in data.sample.unique():
    missing = {200,250,300} - set(data[data.sample==letters].stop)
    for m in missing:
        data = data.append(pd.DataFrame(dict(zip(cols, [[letters], ['chr1'], [100], [m], [0], [0]]))))

data = data.reset_index().drop(['index'], axis=1)

data.head(12)

#out[1]:

enter image description here