重塑为data.frame添加两列

时间:2017-04-27 01:13:13

标签: r reshape

我有一个看起来像这样的data.frame

    timestamp value.x station value.y   parameter.x   value parameter.y
1   1/1/2010  0.6     abc     188,000   AREA PLANTED    22  PROGRESS
2   1/1/2010  0.6     abc     156.3     YIELD           NA  NA
3   1/1/2010  -10     def     188,000   AREA PLANTED    22  PROGRESS
4   1/1/2010  -10     def     156.3     YIELD           NA  NA

我想使用reshape使其看起来像这样:

    timestamp   value.x station AREA PLANTED    YIELD   PROGRESS
1   1/1/2010    0.6     abc     188,000         156.3   22       
3   1/1/2010    -10     def     188,000         156.3   22

我试过

reshape(data = b, varying = list(c('value.y', 'parameter.x', 'value', 'parameter.y')), 
        v.names = c('AREA PLANTED', 'YIELD', 'PROGRESS'), 
        timevar = row.names(b), 
        times = b$timestamp, direction = 'wide', idvar = b$station)

但它说

Error in [.data.frame(data, , idvar) : undefined columns selected

我尝试过更改一下,但无论我做什么,它都会不断抛出这个错误。

3 个答案:

答案 0 :(得分:2)

这使用reshape2。我认为不可能在一个步骤中投射数据帧。请注意,输入似乎是某些其他连接操作的结果(因为某些名称具有.x和。后缀)。我想可以改进连接以避免这种复杂化

df <- read.table(header=TRUE, stringsAsFactors = FALSE, text = 
"timestamp value.x station value.y   parameter.x   value parameter.y
1/1/2010  0.6     abc     188,000   AREAPLANTED    22  PROGRESS
1/1/2010  0.6     abc     156.3     YIELD           NA  NA
1/1/2010  -10     def     188,000   AREAPLANTED    22  PROGRESS
1/1/2010  -10     def     156.3     YIELD           NA  NA
")

library(reshape2)

# extract the last two columns into a variable/value and make unique
df1 <- unique(df[!is.na(df$value),c("timestamp", "value.x", "station", "parameter.y", "value")])
names(df1) <- c("timestamp", "value.x", "station", "variable", "value")

# extract columns 4,5 into a variable value
df2 <- df[,c("timestamp", "value.x", "station", "parameter.x", "value.y")]
names(df2) <- c("timestamp", "value.x", "station", "variable", "value")

# cast
dcast(rbind(df1, df2), timestamp + value.x + station ~ variable, value.var = "value")

#   timestamp value.x station AREAPLANTED PROGRESS YIELD
# 1  1/1/2010   -10.0     def     188,000       22 156.3
# 2  1/1/2010     0.6     abc     188,000       22 156.3

答案 1 :(得分:2)

仍在基数R中,根据需要考虑两个merge数据框之间的reshape。您当前的设置使用的参数用于从长到长的重塑,而不是根据需要反之亦然。

mdf <- merge(
  reshape(b, timevar="parameter.x",
        v.names = c("value.y"),
        idvar = c("timestamp", "value.x", "station"),
        direction = "wide",
        drop = c("value", "parameter.y")),

  reshape(b[!is.na(b$value),], timevar="parameter.y",
        v.names = c("value"),
        idvar = c("timestamp", "value.x", "station"),
        direction = "wide",
        drop = c("value.y", "parameter.x")),
  by=c("timestamp", "value.x", "station")
)

names(mdf) <- gsub("(value\\.y\\.|value\\.)", "", names(mdf))

mdf    
#   timestamp     x station AREA PLANTED YIELD PROGRESS
# 1  1/1/2010 -10.0     def      188,000 156.3       22
# 2  1/1/2010   0.6     abc      188,000 156.3       22

答案 2 :(得分:0)

我同意@ epi99,任务需要分解为步骤并重新组合。这是一种tidyverse方式,假设您的数据框被称为b,如示例代码所示:

library(tidyverse)
b = read.csv("C:\\Temp\\stack_overflow_sample_data_which_I_hacked_together_in_Excel.csv")
df1 = b %>% select(timestamp, value.x, station, value.y, parameter.x) %>% spread(key = parameter.x, value = value.y)
df2 = b %>% select(timestamp, value.x, station, value, parameter.y) %>% filter(!is.na(value)) %>% spread(key = parameter.y, value = value)
df.answer = merge(df1, df2, by = c("timestamp", "value.x", "station"))