如果我有如下数据集:
LA NY MA
1 2 3
4 5 6
3 5
4
(换句话说,每一行都有不同的结构.LA有3个值,NY有4个值,等等)
我正在尝试使用lm
来执行ANOVA测试(以确定每个状态中的平均数是否相同),并且它会一直显示“发生错误”,因为行不匹配。我得到的一个想法是将数据转换为2列格式。我应该使用哪个命令/包来执行该任务?
编辑:数据来自txt文件。
答案 0 :(得分:3)
读取文件以转换为2列格式后的另一个选项是
df <- read.table("Betty.txt", header=TRUE, fill=TRUE, sep="\t")
## (as @Richard Scriven mentioned in the comment)
na.omit(stack(df))
# values ind
#1 1 LA
#2 4 LA
#3 3 LA
#5 2 NY
#6 5 NY
#7 5 NY
#8 4 NY
#9 3 MA
#10 6 MA
上面我通过将数据转换为\t
分隔符来获得。但是,如果直接从OP的帖子中复制/粘贴文件而没有任何更改(确保在第二列之后第3和第4行有空格)
lines <- readLines('Betty1.txt')
lines2 <- gsub("(?<=[^ ]) +|^[ ]+(?<=[ ])(?=[^ ])", ",", lines, perl=TRUE)
lines2
#[1] "LA,NY,MA" "1,2,3" "4,5,6" "3,5," ",4,"
df1 <- read.table(text=lines2, sep=',', header=TRUE)
df1
# LA NY MA
#1 1 2 3
#2 4 5 6
#3 3 5 NA
#4 NA 4 NA
然后再做
na.omit(stack(df1))
如果您有固定宽度列,则另一个选项是使用read.fwf
df <- read.fwf('Betty1.txt', widths=c(3,3,3), skip=1)
colnames(df) <- scan('Betty1.txt', nlines=1, what="", quiet=TRUE)
df
# LA NY MA
#1 1 2 3
#2 4 5 6
#3 3 5 NA
#4 NA 4 NA
library(tidyr)
gather(df, Var, Val, LA:MA, na.rm=TRUE)
# Var Val
#1 LA 1
#2 LA 4
#3 LA 3
#4 NY 2
#5 NY 5
#6 NY 5
#7 NY 4
#8 MA 3
#9 MA 6
答案 1 :(得分:0)
只需添加一个&#39; NA&#39;到文本的第4行并尝试:
> ddf = read.table(text="
+ LA NY MA
+ 1 2 3
+ 4 5 6
+ 3 5
+ NA 4
+ ", header=T, fill=T)
>
> ddf
LA NY MA
1 1 2 3
2 4 5 6
3 3 5 NA
4 NA 4 NA
>
> dput(ddf)
structure(list(LA = c(1L, 4L, 3L, NA), NY = c(2L, 5L, 5L, 4L),
MA = c(3L, 6L, NA, NA)), .Names = c("LA", "NY", "MA"), class = "data.frame", row.names = c(NA,
-4L))
>
> mm = melt(ddf)
No id variables; using all as measure variables
>
> mm
variable value
1 LA 1
2 LA 4
3 LA 3
4 LA NA
5 NY 2
6 NY 5
7 NY 5
8 NY 4
9 MA 3
10 MA 6
11 MA NA
12 MA NA
>
> with(mm, aov(value~variable))
Call:
aov(formula = value ~ variable)
Terms:
variable Residuals
Sum of Squares 4.833333 15.166667
Deg. of Freedom 2 6
Residual standard error: 1.589899
Estimated effects may be unbalanced
3 observations deleted due to missingness