我有650,000行的file1,带有两个cols“ Chr”和“ Pos”。我想将此文件与dbsnp(file2)数据转储进行比较,并与dbSNP转储中存在的Chr和Pos col匹配。一旦匹配,将获取相应的rsid。我尝试使用Python Panda,但我的过程被杀死了。当它尝试50000行时,它起作用了。
如何从dbSNP(file2)获取整个数据集(文件1 = 650k行)的rsid
#Program to compare Chr and Pos of a sample with dBSNP and fetching RSIDs
import pandas as pd
df1 = pd.read_csv("v2_infi_chr_pos.csv",sep='\t',dtype='unicode')
df2 = pd.read_csv("dbsnp150_header.txt",sep='\t',dtype='unicode')
df3 = pd.merge(df1, df2, on='Chr''Pos', how='inner')
export_csv = df3.to_csv (r'rsids_infiniumv2_hg38.txt', index = None, header=True)
答案 0 :(得分:1)
根据Mohit's comment并通读Pandas 0.24.2 merge
文档,这就是我的处理方式-
# Program to compare Chr and Pos of a sample with dBSNP and fetching RSIDs
# import pandas
import pandas as pd
# read in data files
df1 = pd.read_csv("v2_infi_chr_pos.csv",sep='\t',dtype='unicode')
df2 = pd.read_csv("dbsnp150_header.txt",sep='\t',dtype='unicode')
# merge on matched columns
df3 = df1.merge(df2, on=['Chr', 'Pos'], how='inner')
# export merged df to file
export_csv = df3.to_csv (r'rsids_infiniumv2_hg38.txt', index = None, header=True)
on
中的df.merge()
参数将一个标签或多个标签作为列表。由于您要在多个列上进行匹配,因此可以提供列名列表。
此外,您的流程如何被终止?发布您的错误消息会更有帮助。