给出一个像这样的数据帧的简单例子:
sample chrom start stop count psi5
sampleA chr1 100 200 75 0.75
sampleA chr1 100 250 25 0.25
sampleB chr1 100 200 50 1.0
sampleC chr1 100 250 50 1.0
sampleD chr1 100 300 1 NaN
如何为每个样本添加行,而不是对第3列的所有唯一值(从0开始)进行观察?
sampleA chr1 100 200 75 0.75
sampleA chr1 100 250 25 0.25
sampleB chr1 100 200 50 1.0
sampleC chr1 100 250 50 1.0
sampleD chr1 100 300 1 NaN
sampleA chr1 100 300 0 0
sampleB chr1 100 250 0 0
sampleB chr1 100 300 0 0
sampleC chr1 100 200 0 0
sampleC chr1 100 300 0 0
sampleD chr1 100 200 NaN NaN
sampleD chr1 100 250 NaN NaN
因此sampleA
没有对第3列= 300
进行观察,因此我们在第4列和第5列中添加了带零的行。但是{{1} },它只有sampleD
count
,所以它没有通过标准,因此1
的值是NaN,因为我可以跳过任何一个可能从这里做一个数据透视表,用na填充emtpy,或用psi5
添加一行。
这段代码按照一个小例子做了我想做的事:https://gist.github.com/olgabot/1b4234c28b245e52bfc0
但它没有很好的矢量化。
答案 0 :(得分:2)
我可能会使用stack
和unstack
以矢量化方式执行此操作。 sampleD的NaN非常棘手,因为我需要使用由unstacking引起的Nan来填充stop列。但你可以在开头摆脱sampleD,最后将NaN添加到sampleD(这就是我要做的):
一下子:
df = df.set_index(['sample','chrom','start','stop'])
df = df.unstack(['sample','chrom','start']).fillna(0)
df = df.stack(['sample','chrom','start']).reset_index()
df.loc[df.sample == 'sampleD',['count','psi5']] = np.nan
print df
stop sample chrom start count psi5
0 200 sampleA chr1 100 75 0.75
1 200 sampleB chr1 100 50 1.00
2 200 sampleC chr1 100 0 0.00
3 200 sampleD chr1 100 NaN NaN
4 250 sampleA chr1 100 25 0.25
5 250 sampleB chr1 100 0 0.00
6 250 sampleC chr1 100 50 1.00
7 250 sampleD chr1 100 NaN NaN
8 300 sampleA chr1 100 0 0.00
9 300 sampleB chr1 100 0 0.00
10 300 sampleC chr1 100 0 0.00
11 300 sampleD chr1 100 NaN NaN
一步一步
1)将['sample','chrom','start','stop']设为索引:
df = df.set_index(['sample','chrom','start','stop'])
print df
count psi5
sample chrom start stop
sampleA chr1 100 200 75 0.75
250 25 0.25
sampleB chr1 100 200 50 1.00
sampleC chr1 100 250 50 1.00
sampleD chr1 100 300 1 NaN
2)除了停止之外的所有索引的取消堆栈并用零填充由unstack创建的缺失值:
df = df.unstack(['sample','chrom','start'])
print df
count psi5
sample sampleA sampleB sampleC sampleD sampleA sampleB sampleC sampleD
chrom chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1
start 100 100 100 100 100 100 100 100
stop
200 75 50 NaN NaN 0.75 1 NaN NaN
250 25 NaN 50 NaN 0.25 NaN 1 NaN
300 NaN NaN NaN 1 NaN NaN NaN NaN
df = df.fillna(0)
print df
count psi5
sample sampleA sampleB sampleC sampleD sampleA sampleB sampleC sampleD
chrom chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1
start 100 100 100 100 100 100 100 100
stop
200 75 50 0 0 0.75 1 0 0
250 25 0 50 0 0.25 0 1 0
300 0 0 0 1 0.00 0 0 0
3)现在重新打包回到旧的面板表格,但现在每个分组的停止值分别为200,250和300:
df = df.stack(['sample','chrom','start']).reset_index()
print df
stop sample chrom start count psi5
0 200 sampleA chr1 100 75 0.75
1 200 sampleB chr1 100 50 1.00
2 200 sampleC chr1 100 0 0.00
3 200 sampleD chr1 100 0 0.00
4 250 sampleA chr1 100 25 0.25
5 250 sampleB chr1 100 0 0.00
6 250 sampleC chr1 100 50 1.00
7 250 sampleD chr1 100 0 0.00
8 300 sampleA chr1 100 0 0.00
9 300 sampleB chr1 100 0 0.00
10 300 sampleC chr1 100 0 0.00
11 300 sampleD chr1 100 1 0.00
4)为样本D添加NaN:
df.loc[df.sample == 'sampleD',['count','psi5']] = np.nan
答案 1 :(得分:0)
假设您正在使用pandas并且您的数据是.csv文件
#!/usr/bin/env python
import pandas as pd
cols = ['sample', 'chrom', 'start', 'stop', 'count', 'psi5'] # Name Columns
data = pd.read_csv('./stuff.txt', header=False, names=cols, sep=',',index_col=False) # Import data
for letters in data.sample.unique():
missing = {200,250,300} - set(data[data.sample==letters].stop)
for m in missing:
data = data.append(pd.DataFrame(dict(zip(cols, [[letters], ['chr1'], [100], [m], [0], [0]]))))
data = data.reset_index().drop(['index'], axis=1)
data.head(12)
#out[1]: