将一个数据集中的缺失值(NA)替换为列匹配

时间:2015-09-17 19:51:56

标签: r plyr

我有一个包含3列的数据框(datadf),' x',' y和z。几个' x'缺少值(NA)。 ' Y'和' z'是非测量变量。

x    y z
153  a 1
163  b 1
NA   d 1
123  a 2 
145  e 2
NA   c 2 
NA   b 1
199  a 2

我有另一个具有相同三列的数据框(imputeddf):

 x  y z
123 a 1
145 a 2
124 b 1
168 b 2
123 c 1
176 c 2
184 d 1
101 d 2

我希望在' x'中替换NA in' datadf'来自<'imputeddf'的值在哪里' y'和' z'两个数据集之间的匹配(' y'和' z'的每个组合都有自己的' x'来填充)。

期望的结果:

x    y z
153  a 1
163  b 1
184  d 1
123  a 2 
145  e 2
176  c 2 
124  b 1
199  a 2

我正在尝试这样的事情:

finaldf <- datadf
finaldf$x <- if(datadf[!is.na(datadf$x)]){ddply(datadf, x=imputeddf$x[datadf$y == imputeddf$y & datadf$z == imputeddf$z])}else{datadf$x}

但它不起作用。

使用我的估算值df填写NA的最佳方法是什么?

3 个答案:

答案 0 :(得分:6)

我会这样做:

library(data.table)
setDT(DF1); setDT(DF2)

DF1[DF2, x := ifelse(is.na(x), i.x, x), on=c("y","z")]

给出了

     x y z
1: 153 a 1
2: 163 b 1
3: 184 d 1
4: 123 a 2
5: 145 e 2
6: 176 c 2
7: 124 b 1
8: 199 a 2

评论。这种方法不是很好,因为它合并了DF1整个,而我们只需要合并is.na(x)的子集。在这里,改进看起来像(谢谢,@ Arun):

DF1[is.na(x), x := DF2[.SD, x, on=c("y", "z")]]

这种方式类似于@ RHertel的答案。

答案 1 :(得分:3)

以下是基础R的替代方案:

df1[is.na(df1$x),"x"] <- merge(df2,df1[is.na(df1$x),][,c("y","z")])$x
> df1
#    x y z
#1 153 a 1
#2 163 b 1
#3 124 b 1
#4 123 a 2
#5 145 e 2
#6 176 c 2
#7 184 d 1
#8 199 a 2

答案 2 :(得分:0)

一个function Show-Menu { [CmdletBinding()] param ( [string[]]$options = @('FileZilla', 'Posh-SSH PowerShell', 'WinSCP','PuTTY') ) # store the options in a List object for easy addition $list = [System.Collections.Generic.List[string]]::new() $list.AddRange($options) # now start an endless loop for the menu handling while ($true) { Clear-Host # loop through the options list and build the menu Write-Host "`r`nPlease choose from the list below.`r`n" $index = 1 $list.Sort() $list | ForEach-Object { Write-Host ("{0}.`t{1}" -f $index++, $_ )} Write-Host "`r`nN.`tAdd a new item to the list" Write-Host "Q.`tQuit" $selection = Read-Host "`r`nEnter Option" switch ($selection) { {$_ -like 'N*' } { # the user want to add a new item to the menu $item = (Read-Host "Please add a new item").Trim() if (![string]::IsNullOrWhiteSpace($item) -and $list -notcontains $item) { Write-Host "Adding new item '$item'.." -ForegroundColor Yellow $list.Add($item) } } {$_ -like 'Q*' } { # if the user presses 'Q', exit the function return } default { # test if a valid numeric input in range has been given if ([int]::TryParse($selection, [ref]$index)) { if ($index -gt 0 -and $index -le $list.Count) { # do whatever you need to perform $selection = $list[$index - 1] # this gives you the text of the selected item # for demo, just output on screen what option was selected Write-Host "Building connection using $selection" -ForegroundColor Green # return the selection made to the calling script return $selection } else { Write-Host "Please enter a valid option from the menu" -ForegroundColor Red } } else { Write-Host "Please enter a valid option from the menu" -ForegroundColor Red } } } # add a little pause and start over again Start-Sleep -Seconds 1 } } # call the function $choice = Show-Menu 解决方案,在概念上与上述答案相同。要仅提取与dplyr中的NA对应的imputeddf行,请使用datadf。然后,使用另一个联接匹配回semi_join。 (很遗憾,此步骤不是很干净。)

datadf

这可以满足您的需求:

library(dplyr)
replacement_rows <- imputeddf %>%
  semi_join(datadf %>% filter(is.na(x)), by = c("y", "z"))
datadf <- datadf %>%
  left_join(replacement_rows, by = c("y", "z")) %>%
  mutate(x = if_else(is.na(x.x), x.y, x.x)) %>%
  select(x, y, z)