这与Are there more elegant ways to transform ragged data into a tidy dataframe
有关为什么以下代码无效:
events = structure(list(date = structure(c(-714974, -714579, -717835), class = "Date"),
days = c(1, 6, 0.5), name = c("Intro to stats", "Stats Winter school",
"TidyR tools"), topics = c("probability|R", "R|regression|ggplot",
"tidyR|dplyr")), .Names = c("date", "days", "name", "topics"
), row.names = c(NA, -3L), class = "data.frame")
> newdf <- data.frame(topic=character(), days=character())
> for(i in 1:length(events$topics)){
+ xx = unlist(strsplit(events$topics[i],'\\|'))
+ for(j in 1:length(xx)){
+ yy = c(xx[j], events$days[i]/length(xx))
+ print(yy)
+ newdf=rbind(newdf, yy)
+ }
+ }
[1] "probability" "0.5"
[1] "R" "0.5"
[1] "R" "2"
[1] "regression" "2"
[1] "ggplot" "2"
[1] "tidyR" "0.25"
[1] "dplyr" "0.25"
There were 11 warnings (use warnings() to see them)
> newdf
X.probability. X.0.5.
1 probability 0.5
2 <NA> 0.5
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
7 <NA> <NA>
>
> warnings()
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA ... :
invalid factor level, NAs generated
2: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
3: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
4: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
5: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
6: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
7: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
8: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
9: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
10: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, NA, ... :
invalid factor level, NAs generated
11: In `[<-.factor`(`*tmp*`, ri, value = structure(c(1L, 1L, ... :
invalid factor level, NAs generated
>
你可以,但是rbind没有用。错误在哪里以及如何纠正错误?谢谢你的帮助。
答案 0 :(得分:5)
您可以尝试:
newdf <- data.frame(topic=character(), daysPerTopic=character(), stringsAsFactors=F)
for(i in 1:length(events$topics)){
xx = unlist(strsplit(events$topics[i],'\\|'))
for(j in 1:length(xx)){
yy = data.frame(topic=xx[j], daysPerTopic=events$days[i]/length(xx), stringsAsFactors=F)
newdf <- rbind(newdf, yy)
}
}
newdf
# topic daysPerTopic
# 1 probability 0.50
# 2 R 0.50
# 3 R 2.00
# 4 regression 2.00
# 5 ggplot 2.00
# 6 tidyR 0.25
# 7 dplyr 0.25
或者
op <- options(stringsAsFactors=F) #set to F
#Your code
newdf <- data.frame(topic=character(), days=character())
for(i in 1:length(events$topics)){
xx = unlist(strsplit(events$topics[i],'\\|'))
for(j in 1:length(xx)){
yy = c(xx[j], events$days[i]/length(xx))
print(yy)
newdf=rbind(newdf, yy)
}
}
newdf
# X.probability. X.0.5.
# 1 probability 0.5
# 2 R 0.5
# 3 R 2
# 4 regression 2
# 5 ggplot 2
# 6 tidyR 0.25
# 7 dplyr 0.25
options(op) #et back to default
答案 1 :(得分:5)
您是否尝试调试for
循环?例如,通过添加print(class(yy))
print(str(newdf))
,您会看到在第一次迭代后,两个newdf
向量都成为因子。
# [1] "probability" "0.5"
# [1] "character"
# 'data.frame': 0 obs. of 2 variables:
# $ topic: Factor w/ 0 levels:
# $ days : Factor w/ 0 levels:
# NULL
# [1] "R" "0.5"
# [1] "character"
# 'data.frame': 1 obs. of 2 variables:
# $ X.probability.: Factor w/ 1 level "probability": 1
# $ X.0.5. : Factor w/ 1 level "0.5": 1
# NULL
# [1] "R" "2"
# [1] "character"
# 'data.frame': 2 obs. of 2 variables:
# $ X.probability.: Factor w/ 1 level "probability": 1 NA
# $ X.0.5. : Factor w/ 1 level "0.5": 1 1
...
你会说&#34;但我将它们定义为character
&#34;。没错,但是如果您阅读rbind
文档,您会看到
对于cbind(rbind),忽略零长度(包括NULL)的向量 除非结果为零行(列),否则为S兼容性。 (零范围矩阵在S3中不会发生,在R中不会被忽略。)
rbind
的另一个属性是它从data.frame
继承了它的属性,而其中一个是stringsAsFactors == TRUE
这里发生的事情可以通过虚拟示例轻松说明,请考虑
temp <- data.frame(A = letters[1:3])
str(temp)
## 'data.frame': 3 obs. of 1 variable:
## $ A: Factor w/ 3 levels "a","b","c": 1 2 3
temp$A[3] <- "d"
## Warning message:
## In `[<-.factor`(`*tmp*`, 3, value = c(1L, 2L, NA)) :
## invalid factor level, NA generated
temp$A
## [1] a b <NA>
## Levels: a b c
你可以在这里看到两件事:
data.frame
自动将character
类转换为因子factor
向量时,会将其转换为NA
并抛出您收到的确切错误如@akrun所述,设置为options(stringsAsFactors=F)
将解决您的问题
答案 2 :(得分:3)
设置 选项(stringsAsFactors = FALSE) 并且您的代码应该按预期工作。结果中警告和NA的原因是因为隐式转换为因子以及newdf列和yy之间的类型不匹配,请参阅https://stackoverflow.com/a/1640729/1541036。
为了获得相同结果的更简洁方法,这里有一组使用data.table的解决方案
library(data.table)
events <- as.data.table(events)
events2 <- events[, list(topic=unlist(strsplit(topics, '|', fixed=TRUE))), by=c("date", "days", "name")]
events2[, probability := days / .N, by=name]