来自R读取器的read_csv处理数据的方式与生成的数据不同

时间:2015-07-09 23:50:07

标签: r dplyr

我在最初在此stackoverflow thread中解决的函数中使用ifelse()时遇到了问题。在实施建议之后,代码完全按照期望执行。代码在

之下
country_panel <- function(x, y) {
  ifelse(cnames$time < y, 
    cnames[match(x, cnames$country),]$panel,
    cnames[match(x, cnames$country),]$standardize
 )
 }

使用此

生成虚假数据
 countryname <- c("Viet Nam", "Viet Nam", "Viet Nam", "Viet Nam", "Viet Nam")
year <- c(1974, 1975, 1976, 1977,1978)

df <- data.frame(countryname, year, stringsAsFactors=FALSE)

country <- c("Vietnam, North", "Vietnam, N.", "Vietnam North", "Viet Nam",   "Democratic Republic Of Vietnam")
standardize <- c("Vietnam, Democratic Republic of", "Vietnam, Democratic Republic of", "Vietnam, Democratic Republic of", "Vietnam, Democratic Republic of", "Vietnam, Democratic Republic of")
panel <- c("Vietnam", "Vietnam","Vietnam","Vietnam","Vietnam")
time <- c(1976,1976,1976,1976,1976)

cnames <- data.frame(country, standardize, panel, time, stringsAsFactors = FALSE)

使用

功能评估
 d1 <- df %>% 
   mutate(new_name = country_panel(countryname, year))

但是,当我使用实际数据实现建议时,问题会返回,其中函数不评估ifelse语句中的条件,只返回$panel值。

因为在stringsAsFactors = FALSE中使用data.frame使用假数据我认为使用read.csv(PATH, stringsAsFactors = FALSE)会使用read_csv而不是使用str(),但它们都表现相同。

我还应该注意,我使用dput(head(cnames))检查了数据框中每个向量的属性,并强制它们与我在假数据中找到的相匹配。

可以在GitHub here

上找到复制所有内容的真实数据和脚本

以下是structure(list(country = c("AFGHANISTAN", "Afghanistan", "albania", "ALBANIA", "Albania", "ALGERIA"), standardize = c("Afghanistan", "Afghanistan", "Albania", "Albania", "Albania", "Algeria"), time = c(2015L, 2015L, 2015L, 2015L, 2015L, 2015L), panel = c("Afghanistan", "Afghanistan", "Albania", "Albania", "Albania", "Algeria")), .Names = c("country", "standardize", "time", "panel"), class = c("tbl_df", "data.frame" ), row.names = c(NA, -6L))

dput(head(d1))

structure(list(countryname = c("Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"), year = 1970:1975), .Names = c("countryname", "year"), class = c("tbl_df", "data.frame"), row.names = c(NA, -6L))

print_r()

1 个答案:

答案 0 :(得分:0)

d1 <- df %>% 
  mutate(new_name = country_panel(countryname, year))
df2 <- structure(list(country = c("AFGHANISTAN", "Afghanistan", "albania",
  "ALBANIA", "Albania", "ALGERIA"), standardize = c("Afghanistan", 
  "Afghanistan", "Albania", "Albania", "Albania", "Algeria"), time = c(2015L, 
  2015L, 2015L, 2015L, 2015L, 2015L), panel = c("Afghanistan", 
  "Afghanistan", "Albania", "Albania", "Albania", "Algeria")), .Names =      c("country", 
  "standardize", "time", "panel"), class = c("tbl_df", "data.frame"
  ), row.names = c(NA, -6L))

d2 <- df2 %>% 
  mutate(new_name = country_panel(countryname, year))

这给出了:

Error: wrong result size (5), expected 6 or 1

当前问题是mutate期望country_panel返回6个值,因为df2有6行(dim(df2)),或者,它会回收1个值如所须。事实上,第一个包含补充数据的示例只能起作用,因为行数碰巧匹配。

运行后尝试再次运行示例:

debug(country_panel)
...
# after done:
undebug(country_panel)

这将为您提供调用函数的逐行视图,并且可以检查函数在运行时存在或创建的所有对象(随时随q退出)。

使用顺序匹配可能更好,而不是使用ifelse,首先是国家/地区,然后是时间。或者您可以尝试从传递给函数的x和y向量中创建数据框,与cnames合并,然后从数据框中的条件中选择所需的名称。