Question

我无法将数据帧分配给另一个子集。在下面的示例中，行

ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")

仅修改一列而不是两列。我希望它既不修改列也不修改两者，不仅仅是一个。我写了一个函数，通过将PrefName加入另一个数据框CountyID来填充数据框ds中的NA和ds列cs

正如您所看到的，如果您运行它，测试失败，因为PrefName没有填写。经过一些调试后，我意识到join()正在完成预期的工作要做，但该联接的结果的实际分配以某种方式将PrefName放回NA。

# fully copy-paste-run-able (but broken) code suppressMessages({ library("plyr") library("methods") library("testthat") }) # Fill in the missing PrefName/CountyIDs in delstat # - Find the missing values in Delstat # - Grab the CityState Primary Record values # - Match on zipcode to fill in the holes in the delstat data # - Remove any codes that could not be fixed # - @param ds: delstat dataframe with 6 columns (see test case) # - @param cs: citystate dataframe with 6 columns (see test case) getMissingCounties <- function(ds, cs) { if (length(is.na(ds$CountyID))) { cavities <- which(is.na(ds$CountyID)) fillings <- cs[cs$PrimRec==TRUE, c(1,3,4)] ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left") ds <- ds[!is.na(ds$CountyID),] } return(ds) } test_getMissingCounties <- function() { ds <- data.frame( CityStateKey = c(1, 2, 3, 4 ), ZipCode = c(11, 22, 33, 44 ), Business = c(1, 1, 1, 1 ), Residential = c(1, 1, 1, 1 ), PrefName = c("One", NA , NA, NA), CountyID = c(111, NA, NA, NA)) cs <- data.frame( ZipCode = c(11, 22, 22, 33, 55 ), Name = c("eh", "eh?", "eh?", "eh!?", "ah." ), PrefName = c("One", "To", "Two", "Three", "Five"), CountyID = c(111, 222, 222, 333, 555 ), PrimRec = c(TRUE, FALSE, TRUE, TRUE, TRUE ), CityStateKey = c(1, 2, 2, 3, 5 )) expected <- data.frame( CityStateKey = c(1, 2, 3 ), ZipCode = c(11, 22, 33 ), Business = c(1, 1, 1 ), Residential = c(1, 1, 1 ), PrefName = c("One", "Two", "Three"), CountyID = c(111, 222, 333 )) expect_equal(getMissingCounties(ds, cs), expected) } # run the test test_getMissingCounties()

结果是：

CityStateKey ZipCode Business Residential PrefName CountyID 1 11 1 1 One 111 2 22 1 1 <NA> 222 3 33 1 1 <NA> 333

为什么PrefName被作业设置为NA或如何进行作业，以便我不会丢失数据？

Answer 1

简短的回答是，您可以通过确保数据框中没有因素来避免此问题。您可以在stringsAsFactors=FALSE的通话中使用data.frame(...)来执行此操作。请注意，许多数据导入功能（包括read.table(...)和read.csv(...)）也会默认将字符转换为因子。你可以用同样的方式打败这种行为。

这个问题实际上非常微妙，也是R＆＃34;沉默强制＆＃34;的一个很好的例子。数据类型之间会产生各种问题。

data.frame(...)函数默认将任何字符向量转换为因子。因此，在您的代码中，ds$PerfName是一个具有一个级别的因子，cs$PerfName是一个具有5个级别的因子。所以在你的任务说明中：

ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")

LHS的第5列是1级的因子，RHS的第5列是5级的因子。

在某些情况下 ，当您将具有更多级别的因素分配给具有较少级别的因子时，缺失的级别将设置为NA。考虑一下：

x <- c("A","B",NA,NA,NA) # character vector y <- LETTERS[1:5] # character vector class(x); class(y) # [1] "character" # [1] "character" df <- data.frame(x,y) # x and y coerced to factor sapply(df,class) # df$x and df$y are factors # x y # "factor" "factor" # assign rows 3:5 of col 2 to col 1 df[3:5,1] <- df[3:5,2] # fails with a warning # Warning message: # In `[<-.factor`(`*tmp*`, iseq, value = 3:5) : # invalid factor level, NA generated df # missing levels set to NA # x y # 1 A A # 2 B B # 3 <NA> C # 4 <NA> D # 5 <NA> E

上面的示例等同于您的赋值语句。但是，请注意如果将第2列的所有分配给第1列会发生什么。

# assign all of col 2 to col 1 df <- data.frame(x,y) df[,1] <- df[,2] # succeeds!! df # x y # 1 A A # 2 B B # 3 C C # 4 D D # 5 E E

这很有效。

最后，关于调试的说明：如果您正在调试函数，有时在命令行（例如，在全局环境中）逐行运行语句是有用的。如果你这样做了，你就会得到上面的警告，而在函数调用中，警告会被抑制。

Answer 2

通过使用：

重新实现getMissingCountries，可以满足测试的约束条件

merge(ds[1:4], subset(subset(cs, PrimRec)[c(1, 3, 4)]), by="ZipCode")

警告：ZipCode列始终先发出，与预期结果不同。

但要回答子分配问题：它会中断，因为PrefName和ds的级别集cs不兼容。要么避免使用因子，要么使用relevel。你可能已经错过了R对此的警告，因为测试在某种程度上抑制了警告。

将值分配给R中的数据帧子集

2 个答案: