我正在尝试从spread
对数据集执行tidyr
函数,该数据集包含目的地和原点名称,用于飞机旅程及其乘客人数。我尝试构建一个最终可用于热图的表。因此,我希望在行中使用Origin变量,并将Destination变量作为列。
我尝试使用不同的参数组合运行代码,并使用spread_
,但我总是遇到错误。
如果我将spread_
与key_col
和val_col
一起使用,我会得到:
匹配错误(x,table,nomatch = 0L): 找不到对象'DestinationRegion'
在我的大型数据集中,它会产生另一种类型的错误:
colnames<-
中的错误(*tmp*
,值= c(“ASIA SUB-CONTINENT”,“澳大利亚”,: 'dimnames'[2]的长度不等于数组范围
这是我第一次使用tidyr
并且我开始了解这些包,这看起来并不太复杂。但我几个小时以来一直在研究这个问题,在任何论坛都找不到任何答案。
感谢您的帮助,
以下是数据类型的示例:
data2<-matrix(NA, nrow = 7, ncol=3)
colnames(data2)<-c("Origin.Destination", "Total.Passengers", "Destination.Region")
data2[,1] <- c("EAST AFRICA","SOUTHERN AFRICA","WEST AFRICA", "EAST AFRICA", "SOUTHERN AFRICA", "EAST AFRICA","EAST AFRICA")
data2[,2] <- c(100, 5000, 200, 10000, 200, 20, 4000)
data2[,3] <- c("WESTERN EUROPE", "SOUTH AMERICA", "ASIA", "SOUTH AMERICA", "ASIA", "WESTERN EUROPE", "WESTERN EUROPE")
DATA2&LT; -data.frame(DATA2)
这是我的代码:
DF<-
data2 %>%
spread_(key_ = "Destination.Region",
value_ = "Total.Passengers",
convert = TRUE,
drop = FALSE)
答案 0 :(得分:0)
以下是一些尝试:
1)我会将data2
转换为data.frame
。它使得使用它变得更容易。
data2<-matrix(NA, nrow = 7, ncol=3)
colnames(data2)<-c("Origin.Destination", "Total.Passengers", "Destination.Region")
data2[,1] <- c("EAST AFRICA","SOUTHERN AFRICA","WEST AFRICA", "EAST AFRICA", "SOUTHERN AFRICA", "EAST AFRICA","EAST AFRICA")
data2[,2] <- c(100, 5000, 200, 10000, 200, 20, 4000)
data2[,3] <- c("WESTERN EUROPE", "SOUTH AMERICA", "ASIA", "SOUTH AMERICA", "ASIA", "WESTERN EUROPE", "WESTERN EUROPE")
data3<-data.frame(data2)
2)新的data.frame
需要一个明确的列(通常是索引列)才能使spread_
函数正常工作。否则:
DF<-
data3 %>%
spread_(key_ = "Destination.Region",
value_ = "Total.Passengers",
convert = TRUE,
drop = FALSE)
Error: Duplicate identifiers for rows (1, 6, 7)
但是如果:
data3$index<-1:nrow(data3)
DF<-
data3 %>%
spread_(key_ = "Destination.Region",
value_ = "Total.Passengers",
convert = TRUE,
drop = FALSE)
DF
Origin.Destination index ASIA SOUTH AMERICA WESTERN EUROPE
1 EAST AFRICA 1 NA NA 100
2 EAST AFRICA 2 NA NA NA
3 EAST AFRICA 3 NA NA NA
4 EAST AFRICA 4 NA 10000 NA
5 EAST AFRICA 5 NA NA NA
6 EAST AFRICA 6 NA NA 20
7 EAST AFRICA 7 NA NA 4000
8 SOUTHERN AFRICA 1 NA NA NA
9 SOUTHERN AFRICA 2 NA 5000 NA
10 SOUTHERN AFRICA 3 NA NA NA
11 SOUTHERN AFRICA 4 NA NA NA
12 SOUTHERN AFRICA 5 200 NA NA
13 SOUTHERN AFRICA 6 NA NA NA
14 SOUTHERN AFRICA 7 NA NA NA
15 WEST AFRICA 1 NA NA NA
16 WEST AFRICA 2 NA NA NA
17 WEST AFRICA 3 200 NA NA
18 WEST AFRICA 4 NA NA NA
19 WEST AFRICA 5 NA NA NA
20 WEST AFRICA 6 NA NA NA
21 WEST AFRICA 7 NA NA NA
这里可能有意义的是sum
按来源和目的地划分的总乘客数。这样可以避免使用索引并防止这么多NAs:
Origin <- c("EAST AFRICA","SOUTHERN AFRICA","WEST AFRICA", "EAST AFRICA", "SOUTHERN AFRICA", "EAST AFRICA","EAST AFRICA")
Passengers <- c(100, 5000, 200, 10000, 200, 20, 4000)
Destination <- c("WESTERN EUROPE", "SOUTH AMERICA", "ASIA", "SOUTH AMERICA", "ASIA", "WESTERN EUROPE", "WESTERN EUROPE")
data3<-data.frame(Origin, Passengers, Destination)
DF<-data3 %>% group_by(Origin, Destination) %>%
summarise(Total.Passengers = sum(Passengers)) %>%
spread(Destination, Total.Passengers)
DF
Origin ASIA SOUTH AMERICA WESTERN EUROPE
(fctr) (dbl) (dbl) (dbl)
1 EAST AFRICA NA 10000 4120
2 SOUTHERN AFRICA 200 5000 NA
3 WEST AFRICA 200 NA NA