如何导入和排列用逗号分隔的数据集?

时间:2015-05-15 07:13:13

标签: r

这是一个非常具体的问题。我的数据看起来像下面但比那个和许多文件(不仅仅是一个文件)大得多。

TOURERG_ID,RawDataID,IndexNo,IndexValue
19003771,11,240,1.1858652499
19003771,11,241,1.177533477
19003771,11,242,1.1704270598
19003771,11,243,1.1620838731
19003771,11,244,1.1540253051
19003771,11,245,1.1464526996
19003771,11,246,1.1394576168
19003771,11,247,1.1328267903
19003771,11,248,1.1258228114
19003771,11,249,1.1171001937

19003771,11,249,1.1237839518
19003771,11,250,1.1113389261
19003771,11,251,1.0938118176
19003771,11,252,1.0704340703
19003771,11,253,1.0418955374
19003771,11,254,1.0104241602
19003771,11,255,0.97917606379
19003771,11,256,0.95110409662
19003771,11,257,0.9277733067
19003771,11,258,0.90865127357

19000693,11,240,1.1952986902
19000693,11,241,1.1867360653
19000693,11,242,1.1793816406
19000693,11,243,1.1707059267
19000693,11,244,1.1623008189
19000693,11,245,1.1543825533
19000693,11,246,1.1470470507
19000693,11,247,1.1400880358
19000693,11,248,1.1327804778
19000693,11,249,1.1237839518

19000693,11,252,1.0704340703
19000693,11,253,1.0418955374
19000693,11,254,1.0104241602
19000693,11,255,0.97917606379
19000693,11,256,0.95110409662
19000693,11,257,0.9277733067
19000693,11,258,0.90865127357
19000693,11,259,0.89118257832
19000693,11,260,0.87161311454
19000693,11,261,0.84625725399

我想拥有的内容如下。 这意味着从每个框中,只保留逗号之前的第一个值,向其添加ID,_1表示第一个,_2表示第二个, 然后在最后一个逗号后保留所有值。

    ID_19003771_1   ID_19003771_2  ID_19000693_1   ID_19000693_2
1.1858652499   1.1237839518    1.1952986902   1.0704340703
1.177533477    1.1113389261    1.1867360653   1.0418955374
1.1704270598   1.0938118176    1.1793816406   1.0104241602
1.1620838731   1.0704340703    1.1707059267   0.97917606379
1.1540253051   1.0418955374    1.1623008189   0.95110409662
1.1464526996   1.0104241602    1.1543825533   0.9277733067
1.1394576168   0.97917606379   1.1470470507   0.90865127357
1.1328267903   0.95110409662   1.1400880358   0.89118257832
1.1258228114   0.9277733067    1.1327804778   0.87161311454
1.1171001937   0.90865127357   1.1237839518   0.84625725399

说实话,我甚至不知道从哪里开始

1 个答案:

答案 0 :(得分:1)

我们可以将read.tableblank.lines.skip=FALSE一起使用,将空行读为NA。使用NA行创建分组变量(' gr')和split最后一列' gr'。我们可以用列表命名列表元素 ' TOURERG_ID&#39 ;.如果有相同的“TOUREG_ID”,请使用make.unique创建唯一的ID'。根据评论,如果我们在全局环境中需要单独的data.frames,请使用list2env(尽管不推荐),因为大多数操作都可以在列表本身内完成。

df1 <- read.table('Nemo3.txt', sep=",", stringsAsFactors=FALSE, 
         header=TRUE,blank.lines.skip=FALSE)
indx <- is.na(df1[,1])
gr <- cumsum(indx)
lst <- split(df1[4][-which(indx),,drop=FALSE], gr[-which(indx)])
nm1 <- tapply(df1[,1], gr, 
            FUN= function(x) unique(x[!is.na(x)]))
names(lst) <- paste('ID', make.unique(as.character(nm1)), sep="_")
list2env(lst, envir=.GlobalEnv) 

如果我们需要使用分组列获取单个数据集,

library(tidyr)
res <- unnest(lst, group)