我正在尝试读取一个以空格分隔的文件,其中每个观察点都被换行符中断。有没有办法对值进行read.table或fread扫描,直到整行完整?
标题和前两行数据集如下所示:
tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
750000 4411.765 41 1 1 1 1.5357
76 16.75596 17166.67 27177.04 170 41
1926395 4280.878 39 2 2 3 1.5357
192 22.49376 17166.67 27177.04 450 39
答案 0 :(得分:1)
由于每行最终数据在输入中被分成完整的2行,您可以尝试这一点 -
#read file
txt <- readLines("test.txt")
#extract header and remove it from data
df_header <- strsplit(txt[1], split=" ")[[1]]
txt <- txt[-1]
#merge every 2 subseqeunt lines into one to form a row of final dataframe
idx <- seq(1, length(txt), by=2)
txt[idx] <- paste(txt[idx], txt[idx+1])
txt <- txt[-(idx+1)]
#final data
df <- read.table(text=txt, col.names=df_header)
输出为:
tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
1 750000 4411.765 41 1 1 1 1.5357 76 16.75596 17166.67 27177.04 170 41
2 1926395 4280.878 39 2 2 3 1.5357 192 22.49376 17166.67 27177.04 450 39
示例数据: test.txt
包含
tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
750000 4411.765 41 1 1 1 1.5357
76 16.75596 17166.67 27177.04 170 41
1926395 4280.878 39 2 2 3 1.5357
192 22.49376 17166.67 27177.04 450 39
答案 1 :(得分:1)
我正在阅读您的示例数据,看起来像这样......
tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
1 750000 4411.76500 41.00 1.00 1 1 1.5357 NA NA NA NA NA NA
2 76 16.75596 17166.67 27177.04 170 41 NA NA NA NA NA NA NA
3 1926395 4280.87800 39.00 2.00 2 3 1.5357 NA NA NA NA NA NA
4 192 22.49376 17166.67 27177.04 450 39 NA NA NA NA NA NA NA
因为它们是替代品并且列数较少,我们可以轻松编码
Data=read.csv("mydata.csv")
firstData=Data[!is.na(Data$naux),]
secondData=Data[is.na(Data$naux),]
firstData$hoursw=secondData$tsales
firstData$hourspw=secondData$sales
firstData$inv1=secondData$margin
firstData$inv2=secondData$nown
firstData$ssize=secondData$nfull
firstData$start=secondData$npart
Data=firstData
数据分为2.奇数行和偶数行。然后用偶数roes数据中提供的正确值替换奇数行。 希望这能帮到你!
最终输出是
> firstData
tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
1 750000 4411.765 41 1 1 1 1.5357 76 16.75596 17166.67 27177.04 170 41
3 1926395 4280.878 39 2 2 3 1.5357 192 22.49376 17166.67 27177.04 450 39
> secondData
tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
2 76 16.75596 17166.67 27177.04 170 41 NA NA NA NA NA NA NA
4 192 22.49376 17166.67 27177.04 450 39 NA NA NA NA NA NA NA
> Data
tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
1 750000 4411.765 41 1 1 1 1.5357 76 16.75596 17166.67 27177.04 170 41
3 1926395 4280.878 39 2 2 3 1.5357 192 22.49376 17166.67 27177.04 450 39
答案 2 :(得分:1)
这是一个data.table
解决方案(我已将您的示例复制到文件dfTest.txt
中)。请参阅注释以获得解释:
library(data.table)
#fill=TRUE fills empty cols due to irregular structure with NAs
dt=fread("dfTest.txt",header = TRUE,sep=" ",fill=TRUE)
#cols to fix
selCols=c("hoursw","hourspw","inv1","inv2","ssize","start")
#cols from which to read
otherCols=colnames(dt)[seq_along(selCols)]
#fill missing cols from leading rows and select every 2nd row afterwards
dt[,c(selCols):=shift(.SD,n=1L,type="lead"),
.SDcols=otherCols][seq(1,nrow(dt),2),]