比较一个文件的两个cols与另一个相同cols的文件,并获取matchs_large数据集_14GB

时间:2019-05-03 08:01:30

标签: pandas bioinformatics genome

我有650,000行的file1,带有两个cols“ Chr”和“ Pos”。我想将此文件与dbsnp(file2)数据转储进行比较,并与dbSNP转储中存在的Chr和Pos col匹配。一旦匹配,将获取相应的rsid。我尝试使用Python Panda,但我的过程被杀死了。当它尝试50000行时,它起作用了。

如何从dbSNP(file2)获取整个数据集(文件1 = 650k行)的rsid

#Program to compare Chr and Pos of a sample with dBSNP and fetching RSIDs
import pandas as pd
df1 = pd.read_csv("v2_infi_chr_pos.csv",sep='\t',dtype='unicode')
df2 = pd.read_csv("dbsnp150_header.txt",sep='\t',dtype='unicode')
df3 = pd.merge(df1, df2, on='Chr''Pos', how='inner')
export_csv = df3.to_csv (r'rsids_infiniumv2_hg38.txt', index = None, header=True)

1 个答案:

答案 0 :(得分:1)

根据Mohit's comment并通读Pandas 0.24.2 merge文档,这就是我的处理方式-

# Program to compare Chr and Pos of a sample with dBSNP and fetching RSIDs

# import pandas
import pandas as pd

# read in data files
df1 = pd.read_csv("v2_infi_chr_pos.csv",sep='\t',dtype='unicode')
df2 = pd.read_csv("dbsnp150_header.txt",sep='\t',dtype='unicode')

# merge on matched columns 
df3 = df1.merge(df2, on=['Chr', 'Pos'], how='inner')

# export merged df to file
export_csv = df3.to_csv (r'rsids_infiniumv2_hg38.txt', index = None, header=True)

on中的df.merge()参数将一个标签或多个标签作为列表。由于您要在多个列上进行匹配,因此可以提供列名列表。

此外,您的流程如何被终止?发布您的错误消息会更有帮助。