使用Python检查大数据样本范围内的值

时间:2016-02-17 09:25:25

标签: python-2.7 pandas range

我正在尝试确定大整数列表中的哪些值落在第二个整数数据帧中的任何范围内。我使用嵌套循环和pandas数组工作,但它非常慢,我找不到更有效的方法来做到这一点。我尝试添加一个休息时间,但它没有像我希望的那样表现。以下是我所拥有的一个示例 - 实际上,我的文件有~10k值:

import pandas as pd

mat = pd.DataFrame([[1849,"C", "G", "T", "T"],
[1977,"A", "G", "T", "T"],
[4013,"T", "G", "T", "T"],
[7362,"G", "G", "T", "T"],
[7570,"C", "G", "T", "T"],
[7585,"G", "G", "T", "T"],
[9304,"G", "G", "T", "T"],
[11820,"C", "G", "T", "T"],
[11879,"A", "G", "T", "T"],
[14785,"T", "G", "T", "T"],
[14861,"G", "G", "T", "T"],
[15117,"C", "G", "T", "T"],
[15890,"G", "G", "T", "T"],
[16119,"C", "G", "T", "T"],
[17654,"T", "G", "T", "T"],
[17657,"T", "G", "T", "T"],
[20039,"C", "G", "T", "T"]], columns = ["Pos","Ref", "FileA", "FileB", "FileC"])

cov = pd.DataFrame([["chrom_1", 10,100],
["chrom_1", 10, 1900],
["chrom_1", 2000, 5000],
["chrom_1", 10000, 11111],
["chrom_1", 12110, 13110],
["chrom_1", 13410, 15510],
["chrom_1", 15512, 17510],
["chrom_1", 19512, 20032]], columns = ["Chrom", "Start", "End"])

for file_name in mat.columns.values[-(len(mat.columns.values)-2):]:
    #for each column in the data frame called mat, except Pos and Ref
    row_count = 0
    try:
        # find the file with the ranges matching the column name 
        #this is just example code
#        cov = pd.read_csv(str(find_file(file_name+"*", cov_dir)[0]), sep = "\t")
        cov = cov
        # the column names of this file are Chrom Start End
        #iterate over positions in data frame and if present if file, convert to -
        for value in mat["Pos"]:
            for row in range(len(cov)):
                if value <= cov["End"][row] and value >= cov["Start"][row]:
                    #does the value fall within this range?
                    mat[file_name][row_count] = '-'
                    # mat.loc[:,(file_name, row_count)] = '-'
#                  if value > cov["End"][row]:
#                      break
            row_count += 1
    except:
        pass
        #logging.debug('No file found for: ' + file_name)

print mat

1 个答案:

答案 0 :(得分:0)

我最终使用

if np.logical_and(int(value) <= cov["End"], int(value) >= cov["Start"]).sum() > 0:

这要快得多,我假设逻辑运算符会做类似的工作。