对于某些值,按id匹配数据帧失败

时间:2017-04-28 11:09:14

标签: r

我有一个非常大的数据集,缺少几千个值,如下所示:

df1:
                            doi journal year
1  10.1037/0002-9432.76.1.13    <NA>   NA
2  10.1037/0002-9432.76.1.13    <NA>   NA
3  10.1037/0002-9432.76.1.13    <NA>   NA
4 10.1037/0003-066X.60.8.750    <NA>   NA
5 10.1037/0003-066X.60.8.750    <NA>   NA
6 10.1037/0003-066X.60.8.750    <NA>   NA

我有另一个数据框,其中包含所有缺少的期刊名称和年份:

df2:
                          doi year                             journal
17  10.1037/0002-9432.76.1.13 2006 American Journal of Orthopsychiatry
18  10.1037/0002-9432.76.1.13 2006 American Journal of Orthopsychiatry
19  10.1037/0002-9432.76.1.13 2006 American Journal of Orthopsychiatry
31 10.1037/0003-066x.60.8.750 2005               American Psychologist
32 10.1037/0003-066x.60.8.750 2005               American Psychologist
33 10.1037/0003-066x.60.8.750 2005               American Psychologist

然而,当我试图通过他们的doi值来匹配这两者时

df1$year[is.na(df1$year)] <- df2$year[match(df1$doi[is.na(df1$year)], df2$doi)]
df1$journal[is.na(df1$journal)] <- df2$journal[match(df1$doi[is.na(df1$journal)], df2$doi)]

这仅适用于某些人:

Result:
                             doi                             journal year
1  10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
2  10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
3  10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
4 10.1037/0003-066X.60.8.750                                <NA>   NA
5 10.1037/0003-066X.60.8.750                                <NA>   NA
6 10.1037/0003-066X.60.8.750                                <NA>   NA

我尝试了不同的匹配数据框的方法(如thisthis),以及在加载数据帧时修剪空白区域,但没有成功。 &#34; DOI&#34;和&#34; journal&#34;是字符向量,&#34;年&#34;是一个整数。非常感谢,如果有人有一些见解。

2 个答案:

答案 0 :(得分:1)

OP提到他有一个非常大的数据集,有几千个缺失值

这就是为什么我觉得有必要使用更新加入而不是基础R data.table建议match()解决方案,尽管实际问题已经通过使用{{ 1}}。

tolower()

请注意,这将替换library(data.table) #prepare doi setDT(df1)[, doi := tolower(doi)] setDT(df2)[, doi := tolower(doi)] #join df1[unique(df2), on = "doi", `:=`(year = i.year, journal = i.journal)] df1 # doi journal year #1: 10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006 #2: 10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006 #3: 10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006 #4: 10.1037/0003-066X.60.8.750 American Psychologist 2005 #5: 10.1037/0003-066X.60.8.750 American Psychologist 2005 #6: 10.1037/0003-066X.60.8.750 American Psychologist 2005 yearjournal所有值,而df1中给出的值匹配df2无论是doi还是NA,都无论如何。

如果NA

,则只会替换值
df1[unique(df2), on = "doi", 
    `:=`(year = replace(year, is.na(year), i.year), 
         journal = replace(journal, is.na(journal), i.journal))]

基准

对于三种方法之间的速度比较,df1已附加到自身,因此它有大约100'000行。当df1本身更新时,每个基准测试运行必须以新副本开始。复制操作也包含在基准测试中。

microbenchmark::microbenchmark(
  copy = df1 <- copy(df1_orig),
  OP_match = {
    df1 <- copy(df1_orig)
    df1$year[is.na(df1$year)] <- df2$year[match(df1$doi[is.na(df1$year)], df2$doi)]
    df1$journal[is.na(df1$journal)] <- df2$journal[match(df1$doi[is.na(df1$journal)], df2$doi)]
  },
  update_on_join = {
    df1 <- copy(df1_orig)
    df1[unique(df2), on = "doi", `:=`(year = i.year, journal = i.journal)]
  },
  replace_on_join = {
    df1 <- copy(df1_orig)
    df1[unique(df2), on = "doi", 
        `:=`(year = replace(year, is.na(year), i.year), 
             journal = replace(journal, is.na(journal), i.journal))]
  },
  times = 100L
)

结果显示,对于这种情况,_update_on_join_比使用match()的基数R快近三倍:

Unit: microseconds
            expr       min        lq      mean    median        uq        max neval
            copy   760.449   978.691  1129.290  1071.388  1202.974   2085.383   100
        OP_match 12376.362 14532.352 16215.333 15295.821 17497.497  35352.941   100
  update_on_join  5101.879  5585.939  6136.479  5914.435  6416.240   9272.643   100
 replace_on_join  7998.306  8729.303 11822.586  9367.416  9802.767 227385.521   100

数据

library(data.table)
df1 <- fread(
  "rn                    doi journal year
1  10.1037/0002-9432.76.1.13    <NA>   NA
2  10.1037/0002-9432.76.1.13    <NA>   NA
3  10.1037/0002-9432.76.1.13    <NA>   NA
4  10.1037/0003-066X.60.8.750   <NA>   NA
5  10.1037/0003-066X.60.8.750   <NA>   NA
6  10.1037/0003-066X.60.8.750   <NA>   NA",
  drop = 1, na.strings = c("NA", "<NA>"))
df1[, journal := as.character(journal)]
df1[, year := as.integer(year)]

df2 <- fread(
  "rn,                       doi, year,                             journal
  17,  10.1037/0002-9432.76.1.13, 2006, American Journal of Orthopsychiatry
  18,  10.1037/0002-9432.76.1.13, 2006, American Journal of Orthopsychiatry
  19,  10.1037/0002-9432.76.1.13, 2006, American Journal of Orthopsychiatry
  31, 10.1037/0003-066x.60.8.750, 2005,               American Psychologist
  32, 10.1037/0003-066x.60.8.750, 2005,               American Psychologist
  33, 10.1037/0003-066x.60.8.750, 2005,               American Psychologist",
  drop = 1)

#prepare doi
df1[, doi := tolower(doi)]
df2[, doi := tolower(doi)]

#create benchmark data

df1_orig <- copy(df1)
df2_orig <- copy(df2)

for (i in seq_len(14L)) df1_orig <- rbind(df1_orig, df1_orig)
nrow(df1_orig)

答案 1 :(得分:0)

非常感谢@Sarina和@Jaap,你的评论是正确的,tolower为我解决了这个问题。

  

R是区分大小写的,所以你的问题在doi显示的方式范围内产生NA你在你的doi中有一个df但在另一个df中没有 - Sarina

     

对@ Sarina的评论:对这样的比赛使用tolower(df1 $ doi [is.na(df1 $ year)],tolower(df2 $ doi))可以解决问题。 - Jaap