R:逐行拆分连接数据

时间:2017-06-20 14:57:08

标签: r

我有一个大文本文件,是另一个程序将多个表附加在一起(包括标题)的结果。我想将该文件读入R并将其拆分。

这有点类似于this problem,除了我不知道行号,我需要将它拆分在标题行上,而不是特定的行号。我知道每个表都以头开头,并且Wvlgth值始终以337.0开头并以823.0结束。

这是文本文件。它看起来与此相同,只有550行。

Wvlgth Global
337.0  .4345
337.5  .1256
338.0  .8754
<...>
821.0  .9923
822.0  .7124
823.0  .2999
Wvlgth Global
337.0  .5632
337.5  .1245
338.0  .0012
<...>
821.0  .1987
822.0  .6743
823.0  .2045

以下是在R中生成类似内容的代码:

    df = data.frame("Wvlgth" = c('337.0','337.5','338.0','821.0','822.0','823.0'), 
                   "Global" = c(.4345, .1256, .8754, .9923, .7124, .2999))

我希望这会变成多个数据帧,如下所示:

Dataframe 1

     Wvlgth Global
1    337.0 .4345
2    337.5 .1256
3    338.0 .8754
<...>
548  821.0 .9923
549  822.0 .7124
550  823.0 .2999

Dataframe 2

     Wvlgth Global
1    337.0 .5632
2    337.5 .1245
3    338.0 .0012
<...>
548  821.0 .1987
549  822.0 .6743
550  823.0 .2045

不确定是否有办法通过read.csv执行此操作,或者如果我需要读取整个内容并在事后分割它。

2 个答案:

答案 0 :(得分:1)

以下是使用splitdata.table的方法。基本上,您需要在第1列的“Wvlgth”上使用cumsum创建一个组列。然后,您可以将结果拆分为一个列表。然后,您可以像这样访问该列表的元素:df_list[[1]]

df <- read.table(text="Wvlgth Global
337.0  .4345
                337.5  .1256
                338.0  .8754
                821.0  .9923
                822.0  .7124
                823.0  .2999
                Wvlgth Global
                337.0  .5632
                337.5  .1245
                338.0  .0012
                821.0  .1987
                822.0  .6743
                823.0  .2045",header=FALSE,stringsAsFactors=FALSE)
df$group <- cumsum(df[,1]=="Wvlgth")
df_list <- split(df, list(df$group))

$`1`
      V1     V2 group
1 Wvlgth Global     1
2  337.0  .4345     1
3  337.5  .1256     1
4  338.0  .8754     1
5  821.0  .9923     1
6  822.0  .7124     1
7  823.0  .2999     1

$`2`
       V1     V2 group
8  Wvlgth Global     2
9   337.0  .5632     2
10  337.5  .1245     2
11  338.0  .0012     2
12  821.0  .1987     2
13  822.0  .6743     2
14  823.0  .2045     2

访问单个data.frame:

df_list[[1]]
      V1     V2 group
1 Wvlgth Global     1
2  337.0  .4345     1
3  337.5  .1256     1
4  338.0  .8754     1
5  821.0  .9923     1
6  822.0  .7124     1
7  823.0  .2999     1

此外,如果要设置data.frames的列名,可以使用lapply

new_col_name <- c("Wvlgth", "Global","group")
df_list <- lapply(df_list, setNames, nm = new_col_name) #set names
df_list <- lapply(df_list, function(x) x[-1,]) #remove first row

> df_list
$`1`
  Wvlgth Global group
2  337.0  .4345     1
3  337.5  .1256     1
4  338.0  .8754     1
5  821.0  .9923     1
6  822.0  .7124     1
7  823.0  .2999     1

$`2`
   Wvlgth Global group
9   337.0  .5632     2
10  337.5  .1245     2
11  338.0  .0012     2
12  821.0  .1987     2
13  822.0  .6743     2
14  823.0  .2045     2

答案 1 :(得分:0)

我自己创造了一些蒙特卡洛data

#loading libraries
library(stringi)
library(data.table)
library(plyr)

input <- readLines("path/to/your/csv",warn = F)#reading input csv file
input <- trimws(input)#removing spaces left and right

看起来像:

 >input
 [1] "Wvlgth Global" "337.0 0.4345"  "337.5 0.1256"  "338.0 0.8754"  "821.0 0.9923"  "822.0 0.7124"  "823.0 0.2999"  "Wvlgth Global" "327.0 0.5345"  "317.5 0.5256" 
[11] "358.0 0.4754"  "871.0 0.93235" "882.0 0.2124"  "893.0 0.1999"  "811.0 0.93235" "972.0 0.33235" "Wvlgth Global" "893.0 0.2399"  "193.0 0.5120"  "892.0 0.3199" 

为了将其转换为更有用的格式(data.table):

dt<-data.table(ldply(stri_split(str = input,fixed=" "),"["))#creating data.table
dt[,Wvlgth:=as.numeric(V1)][,Global:=as.numeric(V2)][,V1:=NULL][,V2:=NULL]#performing some column manipulation (by importing as.numeric, the characters are transformed into NA's
dt[,containsNA:=is.na(Wvlgth)]#adding boolean tag if Wvlgth is NA

结果数据表如下所示:

>dt
     Wvlgth  Global containsNA
1:     NA      NA       TRUE
2:  337.0 0.43450      FALSE
3:  337.5 0.12560      FALSE
4:  338.0 0.87540      FALSE
5:  821.0 0.99230      FALSE
6:  822.0 0.71240      FALSE
7:  823.0 0.29990      FALSE
8:     NA      NA       TRUE
9:  327.0 0.53450      FALSE
10:  317.5 0.52560      FALSE
11:  358.0 0.47540      FALSE
12:  871.0 0.93235      FALSE
13:  882.0 0.21240      FALSE
14:  893.0 0.19990      FALSE
15:  811.0 0.93235      FALSE
16:  972.0 0.33235      FALSE
17:     NA      NA       TRUE
18:  893.0 0.23990      FALSE
19:  193.0 0.51200      FALSE
20:  892.0 0.31990      FALSE

然后我们申请

 l1<-split(dt,cumsum(dt$containsNA))

屈服于:

>l1
$`1`
  Wvlgth Global containsNA
1     NA     NA       TRUE
2  337.0 0.4345      FALSE
3  337.5 0.1256      FALSE
4  338.0 0.8754      FALSE
5  821.0 0.9923      FALSE
6  822.0 0.7124      FALSE
7  823.0 0.2999      FALSE

$`2`
   Wvlgth  Global containsNA
8      NA      NA       TRUE
9   327.0 0.53450      FALSE
10  317.5 0.52560      FALSE
11  358.0 0.47540      FALSE
12  871.0 0.93235      FALSE
13  882.0 0.21240      FALSE 
14  893.0 0.19990      FALSE
15  811.0 0.93235      FALSE
16  972.0 0.33235      FALSE

$`3`
  Wvlgth Global containsNA
17     NA     NA       TRUE
18    893 0.2399      FALSE
19    193 0.5120      FALSE
20    892 0.3199      FALSE

最后,为了获取我们想要的格式(删除NA行和containsNA列),我们对列表的每个元素执行以下操作:

lapply(l1,function(x) x[,.SD[(!is.na(Wvlgth))]][,containsNA:=NULL])

导致:

$`1`
   Wvlgth Global
1:  337.0 0.4345 
2:  337.5 0.1256
3:  338.0 0.8754
4:  821.0 0.9923
5:  822.0 0.7124
6:  823.0 0.2999

$`2`
    Wvlgth  Global
1:  327.0 0.53450
2:  317.5 0.52560
3:  358.0 0.47540
4:  871.0 0.93235
5:  882.0 0.21240
6:  893.0 0.19990
7:  811.0 0.93235
8:  972.0 0.33235

$`3`
   Wvlgth Global
1:    893 0.2399
2:    193 0.5120
3:    892 0.3199

附录:

如果不推荐使用指向MC数据的链接,则以下是用于特定问题的data

 Wvlgth Global
 337.0 0.4345
 337.5 0.1256
 338.0 0.8754
 821.0 0.9923
 822.0 0.7124
 823.0 0.2999
 Wvlgth Global
 327.0 0.5345
 317.5 0.5256
 358.0 0.4754
 871.0 0.93235
 882.0 0.2124
 893.0 0.1999
 811.0 0.93235
 972.0 0.33235
 Wvlgth Global
 893.0 0.2399
 193.0 0.5120
 892.0 0.3199