如何在R中读取文件后拆分字段

时间:2013-08-24 21:24:05

标签: r

我在每一行都有一个这种格式的文件:

f1,f2,f3,a1,a2,a3,...,an

此处,f1f2f3是由,分隔的固定字段,但f4是整个a1,a2,...,an所在地n可能会有所不同。

如何将其读入R并方便地将这些可变长度a1存储到an

谢谢。

我的文件如下所示

3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie
...

3 个答案:

答案 0 :(得分:2)

目前尚不清楚“便利店”是什么意思。如果您认为数据框适合您,请尝试以下操作:

df <- read.table(text = "3,a,-4,news,finance
2,b,1,politics
1,a,0
2,c,2,book,movie",
sep = ",", na.strings = "", header = FALSE, fill = TRUE) 

names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3))) 

编辑以下@Ananda Mahto的评论。
来自?read.table: “通过查看前五行输入来确定数据列的数量”。 因此,如果带有数据的最大列数出现在前五行之后的某个位置,则上述解决方案将失败。

失败的例子

# create a file with max five columns in the first five lines,
# and six columns in the sixth row
cat("3, a, -4, news, finance",
"2, b, 1, politics",
"1, a, 0",
"2, c, 2, book,movie",
"1, a, 0",
"2, c, 2, book, movie, news",
file = "df",
sep = "\n")


# based on the first five rows, read.table determines that number of columns is five,
# and creates an incorrect data frame
df <- read.table(file = "df",
             sep = ",", na.strings = "", header = FALSE, fill = TRUE)
df

解决方案

# This can be solved by first counting the maximum number of columns in the text file
ncol <- max(count.fields("df", sep = ","))

# then this count is used in the col.names argument
# to handle the unknown maximum number of columns after row 5.
df <- read.table(file = "df",
       sep = ",", na.strings = "", header = FALSE, fill = TRUE,
       col.names = paste0("f", seq_len(ncol)))

df

# change column names as above
names(df) <- c(paste0("f", 1:3), paste0("a", 1:(ncol(df) - 3))) 
df

答案 1 :(得分:0)

一个开始的地方:

dat <- readLines(file) ## file being your file
df <- data.frame(
  f1=sapply(dat_split, "[[", 1),
  f2=sapply(dat_split, "[[", 2),
  f3=sapply(dat_split, "[[", 3),
  a=unlist( sapply(dat_split, function(x) {
    if (length(x) <= 3) { 
      return(NA)
    } else {
      return(paste(x[4:length(x)], collapse=","))
    }
  }) )
)

当您需要从a中取出时,您可以根据需要进行拆分。

答案 2 :(得分:0)

#
# Read example data
#
txt <- "3,a,-4,news,finance\n2,b,1,politics\n1,a,0\n2,c,2,book,movie"
tc = textConnection(txt)
lines <- readLines(tc)
close(tc)
#
# Solution
#
lines_split <- strsplit(lines, split=",", fixed=TRUE)
ind <- 1:3
df <- as.data.frame(do.call("rbind", lapply(lines_split, "[", ind)))
df$V4 <- lapply(lines_split, "[", -ind) 
#
# Output
#
      V1 V2 V3            V4
1  3  a -4 news, finance
2  2  b  1      politics
3  1  a  0              
4  2  c  2   book, movie