我在每一行都有一个这种格式的文件:
f1,f2,f3,a1,a2,a3,...,an
此处,f1
,f2
和f3
是由,
分隔的固定字段,但f4
是整个a1,a2,...,an
所在地n
可能会有所不同。
如何将其读入R
并方便地将这些可变长度a1
存储到an
?
谢谢。
我的文件如下所示
3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie
...
答案 0 :(得分:2)
目前尚不清楚“便利店”是什么意思。如果您认为数据框适合您,请尝试以下操作:
df <- read.table(text = "3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie",
sep = ",", na.strings = "", header = FALSE, fill = TRUE)
names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3)))
编辑以下@Ananda Mahto的评论。
来自?read.table
:
“通过查看前五行输入来确定数据列的数量”。
因此,如果带有数据的最大列数出现在前五行之后的某个位置,则上述解决方案将失败。
失败的例子
# create a file with max five columns in the first five lines,
# and six columns in the sixth row
cat("3, a, -4, news, finance",
"2, b, 1, politics",
"1, a, 0",
"2, c, 2, book,movie",
"1, a, 0",
"2, c, 2, book, movie, news",
file = "df",
sep = "\n")
# based on the first five rows, read.table determines that number of columns is five,
# and creates an incorrect data frame
df <- read.table(file = "df",
sep = ",", na.strings = "", header = FALSE, fill = TRUE)
df
解决方案
# This can be solved by first counting the maximum number of columns in the text file
ncol <- max(count.fields("df", sep = ","))
# then this count is used in the col.names argument
# to handle the unknown maximum number of columns after row 5.
df <- read.table(file = "df",
sep = ",", na.strings = "", header = FALSE, fill = TRUE,
col.names = paste0("f", seq_len(ncol)))
df
# change column names as above
names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3)))
df
答案 1 :(得分:0)
一个开始的地方:
dat <- readLines(file) ## file being your file
df <- data.frame(
f1=sapply(dat_split, "[[", 1),
f2=sapply(dat_split, "[[", 2),
f3=sapply(dat_split, "[[", 3),
a=unlist( sapply(dat_split, function(x) {
if (length(x) <= 3) {
return(NA)
} else {
return(paste(x[4:length(x)], collapse=","))
}
}) )
)
当您需要从a
中取出时,您可以根据需要进行拆分。
答案 2 :(得分:0)
#
# Read example data
#
txt <- "3,a,-4,news,finance\n2,b,1,politics\n1,a,0\n2,c,2,book,movie"
tc = textConnection(txt)
lines <- readLines(tc)
close(tc)
#
# Solution
#
lines_split <- strsplit(lines, split=",", fixed=TRUE)
ind <- 1:3
df <- as.data.frame(do.call("rbind", lapply(lines_split, "[", ind)))
df$V4 <- lapply(lines_split, "[", -ind)
#
# Output
#
V1 V2 V3 V4
1 3 a -4 news, finance
2 2 b 1 politics
3 1 a 0
4 2 c 2 book, movie