我有一个非常大的数据集,缺少几千个值,如下所示:
df1:
doi journal year
1 10.1037/0002-9432.76.1.13 <NA> NA
2 10.1037/0002-9432.76.1.13 <NA> NA
3 10.1037/0002-9432.76.1.13 <NA> NA
4 10.1037/0003-066X.60.8.750 <NA> NA
5 10.1037/0003-066X.60.8.750 <NA> NA
6 10.1037/0003-066X.60.8.750 <NA> NA
我有另一个数据框,其中包含所有缺少的期刊名称和年份:
df2:
doi year journal
17 10.1037/0002-9432.76.1.13 2006 American Journal of Orthopsychiatry
18 10.1037/0002-9432.76.1.13 2006 American Journal of Orthopsychiatry
19 10.1037/0002-9432.76.1.13 2006 American Journal of Orthopsychiatry
31 10.1037/0003-066x.60.8.750 2005 American Psychologist
32 10.1037/0003-066x.60.8.750 2005 American Psychologist
33 10.1037/0003-066x.60.8.750 2005 American Psychologist
然而,当我试图通过他们的doi值来匹配这两者时
df1$year[is.na(df1$year)] <- df2$year[match(df1$doi[is.na(df1$year)], df2$doi)]
df1$journal[is.na(df1$journal)] <- df2$journal[match(df1$doi[is.na(df1$journal)], df2$doi)]
这仅适用于某些人:
Result:
doi journal year
1 10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
2 10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
3 10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
4 10.1037/0003-066X.60.8.750 <NA> NA
5 10.1037/0003-066X.60.8.750 <NA> NA
6 10.1037/0003-066X.60.8.750 <NA> NA
我尝试了不同的匹配数据框的方法(如this或this),以及在加载数据帧时修剪空白区域,但没有成功。 &#34; DOI&#34;和&#34; journal&#34;是字符向量,&#34;年&#34;是一个整数。非常感谢,如果有人有一些见解。
答案 0 :(得分:1)
OP提到他有一个非常大的数据集,有几千个缺失值。
这就是为什么我觉得有必要使用更新加入而不是基础R data.table
建议match()
解决方案,尽管实际问题已经通过使用{{ 1}}。
tolower()
请注意,这将替换library(data.table)
#prepare doi
setDT(df1)[, doi := tolower(doi)]
setDT(df2)[, doi := tolower(doi)]
#join
df1[unique(df2), on = "doi", `:=`(year = i.year, journal = i.journal)]
df1
# doi journal year
#1: 10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
#2: 10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
#3: 10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
#4: 10.1037/0003-066X.60.8.750 American Psychologist 2005
#5: 10.1037/0003-066X.60.8.750 American Psychologist 2005
#6: 10.1037/0003-066X.60.8.750 American Psychologist 2005
中year
和journal
的所有值,而df1
中给出的值匹配df2
无论是doi
还是NA
,都无论如何。
如果NA
是
df1[unique(df2), on = "doi",
`:=`(year = replace(year, is.na(year), i.year),
journal = replace(journal, is.na(journal), i.journal))]
对于三种方法之间的速度比较,df1
已附加到自身,因此它有大约100'000行。当df1
本身更新时,每个基准测试运行必须以新副本开始。复制操作也包含在基准测试中。
microbenchmark::microbenchmark(
copy = df1 <- copy(df1_orig),
OP_match = {
df1 <- copy(df1_orig)
df1$year[is.na(df1$year)] <- df2$year[match(df1$doi[is.na(df1$year)], df2$doi)]
df1$journal[is.na(df1$journal)] <- df2$journal[match(df1$doi[is.na(df1$journal)], df2$doi)]
},
update_on_join = {
df1 <- copy(df1_orig)
df1[unique(df2), on = "doi", `:=`(year = i.year, journal = i.journal)]
},
replace_on_join = {
df1 <- copy(df1_orig)
df1[unique(df2), on = "doi",
`:=`(year = replace(year, is.na(year), i.year),
journal = replace(journal, is.na(journal), i.journal))]
},
times = 100L
)
结果显示,对于这种情况,_update_on_join_比使用match()
的基数R快近三倍:
Unit: microseconds
expr min lq mean median uq max neval
copy 760.449 978.691 1129.290 1071.388 1202.974 2085.383 100
OP_match 12376.362 14532.352 16215.333 15295.821 17497.497 35352.941 100
update_on_join 5101.879 5585.939 6136.479 5914.435 6416.240 9272.643 100
replace_on_join 7998.306 8729.303 11822.586 9367.416 9802.767 227385.521 100
library(data.table)
df1 <- fread(
"rn doi journal year
1 10.1037/0002-9432.76.1.13 <NA> NA
2 10.1037/0002-9432.76.1.13 <NA> NA
3 10.1037/0002-9432.76.1.13 <NA> NA
4 10.1037/0003-066X.60.8.750 <NA> NA
5 10.1037/0003-066X.60.8.750 <NA> NA
6 10.1037/0003-066X.60.8.750 <NA> NA",
drop = 1, na.strings = c("NA", "<NA>"))
df1[, journal := as.character(journal)]
df1[, year := as.integer(year)]
df2 <- fread(
"rn, doi, year, journal
17, 10.1037/0002-9432.76.1.13, 2006, American Journal of Orthopsychiatry
18, 10.1037/0002-9432.76.1.13, 2006, American Journal of Orthopsychiatry
19, 10.1037/0002-9432.76.1.13, 2006, American Journal of Orthopsychiatry
31, 10.1037/0003-066x.60.8.750, 2005, American Psychologist
32, 10.1037/0003-066x.60.8.750, 2005, American Psychologist
33, 10.1037/0003-066x.60.8.750, 2005, American Psychologist",
drop = 1)
#prepare doi
df1[, doi := tolower(doi)]
df2[, doi := tolower(doi)]
#create benchmark data
df1_orig <- copy(df1)
df2_orig <- copy(df2)
for (i in seq_len(14L)) df1_orig <- rbind(df1_orig, df1_orig)
nrow(df1_orig)
答案 1 :(得分:0)
非常感谢@Sarina和@Jaap,你的评论是正确的,tolower
为我解决了这个问题。
R是区分大小写的,所以你的问题在doi显示的方式范围内产生NA你在你的doi中有一个df但在另一个df中没有 - Sarina
对@ Sarina的评论:对这样的比赛使用tolower(df1 $ doi [is.na(df1 $ year)],tolower(df2 $ doi))可以解决问题。 - Jaap