我有一个类似这样的数据
1.1.1.1 Alcohol dehydrogenase.
1.1.1.2 Alcohol dehydrogenase (NADP(+)).
1.1.1.3 Homoserine dehydrogenase.
1.1.1.4 (R,R)-butanediol dehydrogenase.
1.1.1.5 Transferred entry: 1.1.1.303 and 1.1.1.304.
1.1.1.6 Glycerol dehydrogenase.
1.1.1.7 Propanediol-phosphate dehydrogenase.
1.1.1.8 Glycerol-3-phosphate dehydrogenase (NAD(+)).
1.1.1.9 D-xylulose reductase.
1.1.1.10 L-xylulose reductase.
我用read.table这样加载它
df <- read.table("path to data", header=F, fill=T)
我得到以下数据
df <- structure(list(V1 = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 2L), .Label = c("1.1.1.1", "1.1.1.10", "1.1.1.2", "1.1.1.3",
"1.1.1.4", "1.1.1.5", "1.1.1.6", "1.1.1.7", "1.1.1.8", "1.1.1.9"
), class = "factor"), V2 = structure(c(2L, 2L, 6L, 1L, 9L, 4L,
8L, 5L, 3L, 7L), .Label = c("(R,R)-butanediol dehydrogenase.",
"Alcohol", "D-xylulose", "Glycerol", "Glycerol-3-phosphate",
"Homoserine", "L-xylulose", "Propanediol-phosphate", "Transferred"
), class = "factor"), V3 = structure(c(3L, 2L, 3L, 1L, 4L, 3L,
3L, 2L, 5L, 5L), .Label = c("", "dehydrogenase", "dehydrogenase.",
"entry:", "reductase."), class = "factor"), V4 = structure(c(1L,
3L, 1L, 1L, 4L, 1L, 1L, 2L, 1L, 1L), .Label = c("", "(NAD(+)).",
"(NADP(+)).", "1.1.1.303"), class = "factor"), V5 = structure(c(1L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "and"), class = "factor"),
V6 = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"1.1.1.304."), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
我使用fill = T是因为如果不这样做,它将给我错误
df <- read.table("path/example.txt", header=F, fill=F)
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 6 elements
有没有一种方法可以加载数据并具有两列?或将数据汇总在一起,以便在R中有两列?
请注意,我可以使用read.delim做到这一点,但是它将使我正在使用的另一个代码出现问题
我的愿望输出就像
答案 0 :(得分:1)
不好看,但这是可能的解决方法:
# seperator is multi-space, but not possible in R
file_data <- "1.1.1.1 Alcohol dehydrogenase.
1.1.1.2 Alcohol dehydrogenase (NADP(+)).
1.1.1.3 Homoserine dehydrogenase.
1.1.1.4 (R,R)-butanediol dehydrogenase.
1.1.1.5 Transferred entry: 1.1.1.303 and 1.1.1.304.
1.1.1.6 Glycerol dehydrogenase.
1.1.1.7 Propanediol-phosphate dehydrogenase.
1.1.1.8 Glycerol-3-phosphate dehydrogenase (NAD(+)).
1.1.1.9 D-xylulose reductase.
1.1.1.10 L-xylulose reductase."
# change sep from 4 spaces to \t, which is identifiable.
# replace textConnection(file_data) with your data file name
read_text <- readLines(textConnection(file_data ))
altered_text <- gsub(" ", "\t", read_text)
# parsing from altered text
df <- read.delim(textConnection(altered_text), header=FALSE, sep="\t", fill=TRUE)
df
问题是您的分隔符超过一个字符(http://r.789695.n4.nabble.com/multiple-separators-in-sep-argument-for-read-table-td856567.html)。
替代方法是更改预加载日期,以在列之间使用通用分隔符。否则,请按原样读取数据,然后添加数据步骤以将第2列向前连接到1列,例如使用paste
。
答案 1 :(得分:0)
使用基础R
,可以将Reduce()
与paste()
一起使用,然后使用trimws()
修剪空白以产生另一个data.frame
:
df2 <- data.frame(V1 = df[1], V2 = trimws(Reduce(paste, df[-1])))
> df2
V1 V2
1 1.1.1.1 Alcohol dehydrogenase.
2 1.1.1.2 Alcohol dehydrogenase (NADP(+)).
3 1.1.1.3 Homoserine dehydrogenase.
4 1.1.1.4 (R,R)-butanediol dehydrogenase.
5 1.1.1.5 Transferred entry: 1.1.1.303 and 1.1.1.304.
6 1.1.1.6 Glycerol dehydrogenase.
7 1.1.1.7 Propanediol-phosphate dehydrogenase.
8 1.1.1.8 Glycerol-3-phosphate dehydrogenase (NAD(+)).
9 1.1.1.9 D-xylulose reductase.
10 1.1.1.10 L-xylulose reductase.