我以前发过一个问题,然后使用R
:subset recursively a data.frame来解决,但是文件太大了,我需要大量的时间和RAM内存才能读取它。我想知道我是否可以在python中使用pandas
来做同样的事情,因为我是python的新手,而pandas似乎更像R,至少在它的sintax中。以下是我上一篇文章的摘要:
上一篇文章: 我有一个制表符分隔文件,接近1500万行,其大小为27GB。我需要一种有效的方法来基于两个标准对数据进行子集化。我可以这样做是一个for循环,但想知道是否有更优雅的方式来做到这一点,显然更有效。 data.frame看起来像这样:
SNP CHR BP P
rs1000000 chr1 126890980 0.000007
rs10000010 chr4 21618674 0.262098
rs10000012 chr4 1357325 0.344192
rs10000013 chr4 37225069 0.726325
rs10000017 chr4 84778125 0.204275
rs10000023 chr4 95733906 0.701778
rs10000029 chr4 138685624 0.260899
rs1000002 chr3 183635768 0.779574
rs10000030 chr4 103374154 0.964166
rs10000033 chr2 139599898 0.111846
rs10000036 chr4 139219262 0.564791
rs10000037 chr4 38924330 0.392908
rs10000038 chr4 189176035 0.971481
rs1000003 chr3 98342907 0.000004
rs10000041 chr3 165621955 0.573376
rs10000042 chr3 5237152 0.834206
rs10000056 chr4 189321617 0.268479
rs1000005 chr1 34433051 0.764046
rs10000062 chr4 5254744 0.238011
rs10000064 chr4 127809621 0.000044
rs10000068 chr2 36924287 0.000003
rs10000075 chr4 179488911 0.100225
rs10000076 chr4 183288360 0.962476
rs1000007 chr2 237752054 0.594928
rs10000081 chr1 17348363 0.517486
rs10000082 chr1 167310192 0.261577
rs10000088 chr1 182605350 0.649975
rs10000092 chr4 21895517 0.000005
rs10000100 chr4 19510493 0.296693
我需要做的第一件事就是选择那些P值低于阈值的SNP,然后按照CHR和BP对这个子集进行排序。一旦我有了这个子集,我需要从显着的SNP上下取出落入500,000窗口的所有SNP,这一步将定义一个区域。我需要为所有重要的SNP执行此操作,并将每个区域存储到列表或类似的内容中以进行进一步分析。例如,在显示的数据帧中,CHR == chr1的最高有效SNP(即低于0.001的阈值)是rs1000000,而对于CHR == chr4是rs10000092。因此,这两个SNP将定义两个区域,我需要在每个区域中获取SNP,这些SNP落入每个最重要SNP的POS上下500,000区域。
@eddi和@rafaelpere提供的R代码解决方案如下:
library(data.table) # v1.9.7 (devel version)
df <- fread("C:/folderpath/data.csv") # load your data
setDT(df) # convert your dataset into data.table
#1st step
# Filter data under threshold 0.05 and Sort by CHR, POS
df <- df[ P < 0.05, ][order(CHR, POS)]
#2nd step
df[, {idx = (1:.N)[which.min(P)]
SNP[seq(max(1, idx - 5e5), min(.N, idx + 5e5))]}, by = CHR]
答案 0 :(得分:2)
首先,我强烈建议您从CSV文件切换到PyTables(HDF存储),如果可能的话,将您的DF按['SNP','BP']
排序,因为它的速度提高了几个数量级,允许条件选择(参见where
参数)并且通常占用更少的空间 - 请参阅this comparison。
这是一个工作示例脚本,它执行以下操作:
'SNP', 'CHR', 'BP', 'P', 'SNP2', 'CHR2', 'BP2', 'P2'
)。我故意将列数加倍,因为我认为您的CSV列有更多列,因为我生成的包含20M行和8列的CSV文件的大小仅 1.7GB。 ['CHR','BP']
排序DF并将结果另存为PyTable(.h5)P < threshold
min(SNP) - 500K
和max(SNP) + 500K
之间的所有行 - 您可能希望改进此部分代码:
import numpy as np
import pandas as pd
##############################
# generate sample DF
##############################
rows = 2*10**7
chr_lst = ['chr{}'.format(i) for i in range(1,10)]
df = pd.DataFrame({'SNP': np.random.randint(10**6, 10**7, rows).astype(str)})
df['CHR'] = np.random.choice(chr_lst, rows)
df['BP'] = np.random.randint(10**6, 10**9, rows)
df['P'] = np.random.rand(rows)
df.SNP = 'rs' + df.SNP
"""
# NOTE: sometimes it gives me MemoryError !!!
# because of that i did it "column-by-column" before
df = pd.DataFrame({
'SNP': np.random.randint(10**6, 10**7, rows).astype(str),
'CHR': np.random.choice(chr_lst, rows),
'BP': np.random.randint(10**6, 10**9, rows),
'P': np.random.rand(rows)
}, columns=['SNP','CHR','BP','P'])
df.SNP = 'rs' + df.SNP
"""
# make 8 columns out of 4 ...
df = pd.concat([df]*2, axis=1)
df.columns = ['SNP', 'CHR', 'BP', 'P', 'SNP2', 'CHR2', 'BP2', 'P2']
##############################
# store DF as CSV file
##############################
csv_path = r'c:/tmp/file_8_cols.csv'
df.to_csv(csv_path, index=False)
##############################
# read CSV file (only needed cols) in chunks
##############################
csv_path = r'c:/tmp/file_8_cols.csv'
cols = ['SNP', 'CHR', 'BP', 'P']
chunksize = 10**6
df = pd.concat([x for x in pd.read_csv(csv_path, usecols=cols,
chunksize=chunksize)],
ignore_index=True )
##############################
# sort DF and save it as .h5 file
##############################
store_path = r'c:/tmp/file_sorted.h5'
store_key = 'test'
(df.sort_values(['CHR','BP'])
.to_hdf(store_path, store_key, format='t', mode='w', data_columns=True)
)
##############################
# read HDF5 file in chunks
##############################
store_path = r'c:/tmp/file_sorted.h5'
store_key = 'test'
chunksize = 10**6
store = pd.HDFStore(store_path)
threshold = 0.001
store_condition = 'P < %s' % threshold
i = store.select(key=store_key, where=store_condition)
# select all rows between `min(SNP) - 500K` and `max(SNP) + 500K`
window_size = 5*10**5
start = max(0, i.index.min() - window_size)
stop = min(store.get_storer(store_key).nrows, i.index.max() + window_size)
df = pd.concat([
x for x in store.select(store_key, chunksize=chunksize,
start=start, stop=stop, )
])
# close the store before exiting...
store.close()
示例数据:
In [39]: df.head(10)
Out[39]:
SNP CHR BP P
18552732 rs8899557 chr1 1000690 0.764227
3837818 rs1883864 chr1 1000916 0.145544
13055060 rs2403233 chr1 1001591 0.116835
9303493 rs5567473 chr1 1002297 0.409937
14370003 rs1661796 chr1 1002523 0.322398
9453465 rs8222028 chr1 1004318 0.269862
2611036 rs9514787 chr1 1004666 0.936439
10378043 rs3345160 chr1 1004930 0.271848
16149860 rs4245017 chr1 1005219 0.157732
667361 rs3270325 chr1 1005252 0.395261