根据不同长度的另一个数据帧中是否存在NA过滤一个数据帧

时间:2019-07-24 06:30:22

标签: r dataframe filter na

我有一个数据框A,要根据数据框B中的相应样本名称在第二列(ID)中是否具有NA进行筛选。数据框A中的样品名称重复,而数据框B中的样品名称仅出现一次,从而使数据框的长度不同。

本质上,我希望有一个最终表,其中将从数据帧A中完全删除数据帧B中具有值的样本名称“ ID”列。

我尝试了以下过滤器功能,但它给了我一个与不同数据帧长度有关的错误:

filtered_table <- filter(A_table_to_filter, is.na(B_filter_table$ID))

以下是一些示例数据:

dataframe_A_table_to_filter <- data.frame(sample = c("OP2645ii_d","OP5048___g","OP5046___e","OP5048___g","OP2413iiia","OP5048___g","OP5043___b","OP5048___g","OP3088i__a","OP5048___g","OP5046___a","OP5048___g","OP5048___b","OP5048___g","OP5043___a","OP5048___g","OP2645ii_d","OP5048___f","OP2645ii_d","OP5044___c","OP2413iiib","OP5048___g","OP5046___c","OP5048___g","OP5046___d","OP5046___e","OP5048___e","OP5048___g","OP5046___e","OP5048___c","OP2413iiia","OP5046___e","OP2645ii_b","OP2645ii_d","OP2645ii_a","OP5046___e","OP5046___e","OP5048___d","OP5046___e","OP5048___e","OP2413iiia","OP5048___f","OP5044___c","OP5046___e","OP2413iiia","OP2645ii_c","OP5046___e","OP5047___b","OP2645ii_a","OP2645ii_d","OP5046___c","OP5046___e","OP5046___d","OP5048___g","OP2645ii_e","OP5048___g","OP2645ii_c","OP5046___d","OP5048___c","OP5048___g","OP2645ii_c","OP5048___c","OP2645ii_c","OP5048___e","OP2645ii_c","OP5048___g","OP5046___e","OP5048___f","OP2645ii_d","OP5046___d","OP2645ii_c","OP5046___c","OP2645ii_d","OP5048___d","OP5043___b","OP5048___f","OP5046___c","OP5048___f","OP2645ii_d","OP5048___c","OP2413iiib","OP5046___e","OP2413iiib","OP5048___f","OP5044___a","OP5048___g","OP5043___a","OP5048___f","OP3088i__a","OP5048___f","OP5048___e","OP5048___f","OP5044___c","OP5048___b","OP2645ii_d","OP5047___b","OP2413iiia","OP2645ii_b","OP5046___a","OP5048___f","OP5043___b","OP5044___c","OP2645ii_c","OP5048___d","OP5047___b","OP5048___g","OP5048___b","OP5048___f","OP2413iiia","OP5044___c","OP2645ii_b","OP5046___e","OP2645ii_c","OP5047___b","OP5044___c","OP5046___a","OP2413iiib","OP2645ii_c","OP2645ii_e","OP5046___e","OP5048___d","OP5048___g","OP5046___d","OP5048___b","OP2645ii_a","OP2645ii_c","OP3088i__a","OP5044___c"),
                    gr = c("gr3","gr2","gr5","gr2","gr1","gr2","gr1","gr2","gr1","gr2","gr5","gr2","gr5","gr2","gr3","gr2","gr3","gr2","gr3","gr2","gr4","gr2","gr1","gr2","gr4","gr5","gr1","gr2","gr5","gr4","gr1","gr5","gr2","gr3","gr1","gr5","gr5","gr4","gr5","gr1","gr1","gr2","gr2","gr5","gr1","gr5","gr5","gr4","gr1","gr3","gr1","gr5","gr4","gr2","gr1","gr2","gr5","gr4","gr4","gr2","gr5","gr4","gr5","gr1","gr5","gr2","gr5","gr2","gr3","gr4","gr5","gr1","gr3","gr4","gr1","gr2","gr1","gr2","gr3","gr4","gr4","gr5","gr4","gr2","gr3","gr2","gr3","gr2","gr1","gr2","gr1","gr2","gr2","gr5","gr3","gr4","gr1","gr2","gr5","gr2","gr1","gr2","gr5","gr4","gr4","gr2","gr5","gr2","gr1","gr2","gr2","gr5","gr5","gr4","gr2","gr5","gr4","gr5","gr1","gr5","gr4","gr2","gr4","gr5","gr1","gr5","gr1","gr2"), dist = c(7.59036265840066,7.59036265840066,6.44967614976991,6.44967614976991,6.41995474653303,6.41995474653303,6.34991780754275,6.34991780754275,6.18262339507581,6.18262339507581,6.16265512136205,6.16265512136205,6.15423247141993,6.15423247141993,6.14014702309176,6.14014702309176,6.05863330633262,6.05863330633262,5.96292319399187,5.96292319399187,5.94395576047878,5.94395576047878,5.86375256401321,5.86375256401321,5.78102441659872,5.78102441659872,5.7345012847377,5.7345012847377,5.67874617854728,5.67874617854728,5.53957425202641,5.53957425202641,5.44753353881181,5.44753353881181,5.43742118904064,5.43742118904064,5.42270717863966,5.42270717863966,5.40852682965639,5.40852682965639,5.37916907844967,5.37916907844967,5.28542559212653,5.28542559212653,5.28127574537985,5.28127574537985,5.27883657001377,5.27883657001377,5.26111686809869,5.26111686809869,5.25446925024172,5.25446925024172,5.18612527748647,5.18612527748647,5.16152942865884,5.16152942865884,5.13493683199873,5.13493683199873,5.11477487647704,5.11477487647704,5.02518908529805,5.02518908529805,4.96387986494177,4.96387986494177,4.93803544508224,4.93803544508224,4.90484535173276,4.90484535173276,4.88609183324537,4.88609183324537,4.87064721174553,4.87064721174553,4.87044988024298,4.87044988024298,4.87018300982248,4.87018300982248,4.81850235997663,4.81850235997663,4.81315159594962,4.81315159594962,4.79708386349633,4.79708386349633,4.79137478521543,4.79137478521543,4.79076662890575,4.79076662890575,4.7629557294752,4.7629557294752,4.75107063347786,4.75107063347786,4.73927394720927,4.73927394720927,4.65856308508064,4.65856308508064,4.65459244413676,4.65459244413676,4.65168460273128,4.65168460273128,4.64631379714574,4.64631379714574,4.63427356346989,4.63427356346989,4.61758860663907,4.61758860663907,4.61520572342783,4.61520572342783,4.59738310693479,4.59738310693479,4.56270527374553,4.56270527374553,4.53521289030436,4.53521289030436,4.52843905005562,4.52843905005562,4.51867277408847,4.51867277408847,4.50634336104738,4.50634336104738,4.46047201471265,4.46047201471265,4.45241678415362,4.45241678415362,4.43613430884318,4.43613430884318,4.43212669019848,4.43212669019848,4.41051867890157,4.41051867890157))


dataframe_B_filter_table <- data.frame(sample = c("OP2413iiia","OP2413iiib","OP2645ii_a","OP2645ii_b","OP2645ii_c","OP2645ii_d","OP2645ii_e","OP3088i__a","OP5043___a","OP5043___b","OP5044___a","OP5044___b","OP5044___c","OP5046___a","OP5046___b","OP5046___c","OP5046___d","OP5046___e","OP5047___a","OP5047___b","OP5048___b","OP5048___c","OP5048___d","OP5048___e","OP5048___f","OP5048___g","OP5048___h","OP5049___a","OP5049___b","OP5051DNAa","OP5051DNAc","OP5052DNAa","OP5053DNAa","OP5053DNAb","OP5053DNAc","OP5054DNAa","OP5054DNAb","OP5054DNAc","OP5051DNAb"),
                   ID = c(NA,NA,"gr1",NA,NA,"gr3",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"gr3",NA,NA,NA,NA,NA,NA,NA,NA,"gr2",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))

我希望有一个表,其中数据框B中带有样本名称的行(不是NA(即具有值))已从数据框A中删除。但是,我收到一个与表长度不同有关的错误。

2 个答案:

答案 0 :(得分:1)

因此,您想保留与其对应的sample具有ID值的NA。 我们可以match sampleID获取相应的dfB,并且仅当它返回NA

时保留
dfA[is.na(dfB$ID[match(dfA$sample, dfB$sample)]), ]

#       sample  gr   dist
#3   OP5046___e gr5 6.4497
#5   OP2413iiia gr1 6.4200
#7   OP5043___b gr1 6.3499
#9   OP3088i__a gr1 6.1826
#11  OP5046___a gr5 6.1627
#13  OP5048___b gr5 6.1542
#....

为便于阅读,将您的数据框重命名为dfAdfB


如果您确定dfA中包含dfB中的每个值,我们也可以使用

dfA[dfA$sample %in% dfB$sample[is.na(dfB$ID)], ]

答案 1 :(得分:1)

仅意识到已发布答案,将给出基于{data.table}的解决方案。 为了附加到您的代码中,我运行了以下几行。

DT_A <- data.table(dataframe_A_table_to_filter)
DT_B <- data.table(dataframe_B_filter_table)

KeepSamples <- DT_B[is.na(ID), sample]
DT_A <- DT_A[sample %in% KeepSamples, ]