r:读取数据集,其中每个观察分为2行?

时间:2018-05-17 07:36:43

标签: r dataset newline

我正在尝试读取一个以空格分隔的文件,其中每个观察点都被换行符中断。有没有办法对值进行read.table或fread扫描,直到整行完整?

标题和前两行数据集如下所示:

   tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
       750000   4411.765         41          1          1          1     1.5357
           76   16.75596   17166.67   27177.04        170         41
      1926395   4280.878         39          2          2          3     1.5357
          192   22.49376   17166.67   27177.04        450         39

3 个答案:

答案 0 :(得分:1)

由于每行最终数据在输入中被分成完整的2行,您可以尝试这一点 -

#read file
txt <- readLines("test.txt")

#extract header and remove it from data
df_header <- strsplit(txt[1], split=" ")[[1]]
txt <- txt[-1]

#merge every 2 subseqeunt lines into one to form a row of final dataframe
idx <- seq(1, length(txt), by=2)
txt[idx] <- paste(txt[idx], txt[idx+1])
txt <- txt[-(idx+1)]

#final data
df <- read.table(text=txt, col.names=df_header)

输出为:

   tsales    sales margin nown nfull npart   naux hoursw  hourspw     inv1     inv2 ssize start
1  750000 4411.765     41    1     1     1 1.5357     76 16.75596 17166.67 27177.04   170    41
2 1926395 4280.878     39    2     2     3 1.5357    192 22.49376 17166.67 27177.04   450    39

示例数据: test.txt包含

tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
750000   4411.765         41          1          1          1     1.5357
76   16.75596   17166.67   27177.04        170         41
1926395   4280.878         39          2          2          3     1.5357
192   22.49376   17166.67   27177.04        450         39

答案 1 :(得分:1)

我正在阅读您的示例数据,看起来像这样......

   tsales      sales   margin     nown nfull npart   naux hoursw hourspw inv1 inv2 ssize start
1  750000 4411.76500    41.00     1.00     1     1 1.5357     NA      NA   NA   NA    NA    NA
2      76   16.75596 17166.67 27177.04   170    41     NA     NA      NA   NA   NA    NA    NA
3 1926395 4280.87800    39.00     2.00     2     3 1.5357     NA      NA   NA   NA    NA    NA
4     192   22.49376 17166.67 27177.04   450    39     NA     NA      NA   NA   NA    NA    NA

因为它们是替代品并且列数较少,我们可以轻松编码

Data=read.csv("mydata.csv")
firstData=Data[!is.na(Data$naux),]
secondData=Data[is.na(Data$naux),]
firstData$hoursw=secondData$tsales
firstData$hourspw=secondData$sales
firstData$inv1=secondData$margin
firstData$inv2=secondData$nown
firstData$ssize=secondData$nfull
firstData$start=secondData$npart
Data=firstData

数据分为2.奇数行和偶数行。然后用偶数roes数据中提供的正确值替换奇数行。 希望这能帮到你!

最终输出是

> firstData
   tsales    sales margin nown nfull npart   naux hoursw  hourspw     inv1     inv2 ssize start
1  750000 4411.765     41    1     1     1 1.5357     76 16.75596 17166.67 27177.04   170    41
3 1926395 4280.878     39    2     2     3 1.5357    192 22.49376 17166.67 27177.04   450    39

> secondData
  tsales    sales   margin     nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
2     76 16.75596 17166.67 27177.04   170    41   NA     NA      NA   NA   NA    NA    NA
4    192 22.49376 17166.67 27177.04   450    39   NA     NA      NA   NA   NA    NA    NA

> Data
   tsales    sales margin nown nfull npart   naux hoursw  hourspw     inv1     inv2 ssize start
1  750000 4411.765     41    1     1     1 1.5357     76 16.75596 17166.67 27177.04   170    41
3 1926395 4280.878     39    2     2     3 1.5357    192 22.49376 17166.67 27177.04   450    39

答案 2 :(得分:1)

这是一个data.table解决方案(我已将您的示例复制到文件dfTest.txt中)。请参阅注释以获得解释:

library(data.table)
#fill=TRUE fills empty cols due to irregular structure with NAs
dt=fread("dfTest.txt",header = TRUE,sep=" ",fill=TRUE)
#cols to fix
selCols=c("hoursw","hourspw","inv1","inv2","ssize","start")
#cols from which to read
otherCols=colnames(dt)[seq_along(selCols)]
#fill missing cols from leading rows and select every 2nd row afterwards
dt[,c(selCols):=shift(.SD,n=1L,type="lead"),
    .SDcols=otherCols][seq(1,nrow(dt),2),]