我有一个包含一列的大型数据框,包含以空格分隔的不同数值,我需要在列中提取和组织
<Call Begin=6.0982886400000051 End=6.1078732800000051 MaxFreq=40893.5546875 MinFreq=35400.390625 PeakFreq=39672.8515625 PeakFreqs=39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 36621.09375 36621.09375 36621.09375 36621.09375 Intensity=-14.902734633213136 Periodicity=0.853448275862069 Shape=- CallType=cf-n Species=Pipistrellus kuhlii (77%), Pipistrellus nathusii (77%) Custom=false />
这是关于我的数据的更多信息
'data.frame':39 obs. of 1 variable $ x1: Factor w/ 120 levels "
<double>25.318181818181806</double>",..: 66 67 68 69 70 71 72 73 74 75...
我需要这样的东西:
call_begin call_end maxfrec minfrec
1 0.59170816000000048 0.60006400000000049 531.005.859.375 433.349.609.375
2 0.7636582400000006 0.77135872000000061 531.005.859.375 42.724.609.375
peakfrec
1 482.177.734.375
2 469.970.703.125
我有一些想法可以实现这一点,首先尝试使用分隔列 strsplit,后来使用substr函数,提取数字,最后rbind创建一个表,我发现了一些带有一些相关主题的线程,但我可以在我的数据中复制它。
我会感谢任何帮助,如果不清楚,请告诉我。
答案 0 :(得分:1)
与您描述的类似的解决方案。此解决方案更通用,并且不依赖于列数:
text <- '<Call Begin=0.59170816000000048 End=0.60006400000000049 MaxFreq=53100.5859375 MinFreq=43334.9609375 PeakFreq=48217.7734375
<Call Begin=0.7636582400000006 End=0.77135872000000061 MaxFreq=53100.5859375 MinFreq=42724.609375 PeakFreq=46997.0703125'
process_line <- function(line) {
sp <- strsplit(line, ' ')[[1]][-1]
cn <- sapply(sp, function(x) strsplit(x, "=")[[1]][1])
data <- sapply(sp, function(x) as.numeric(strsplit(x, "=")[[1]][2]))
names(data) <- cn
data
}
t(sapply(strsplit(text, "\n")[[1]], process_line, USE.NAMES = FALSE))
Begin End MaxFreq MinFreq PeakFreq
[1,] 0.5917082 0.6000640 53100.59 43334.96 48217.77
[2,] 0.7636582 0.7713587 53100.59 42724.61 46997.07
它基于以下假设:测试不是由行分隔,否则strsplit(text, "\n")[[1]]
与text
分开。
不需要使用正则表达式,因为可以通过=
答案 1 :(得分:0)
gsub是我最喜欢的。
strList = list("<Call Begin=0.59170816000000048 End=0.60006400000000049 MaxFreq=53100.5859375 MinFreq=43334.9609375 PeakFreq=48217.7734375", "<Call Begin=0.7636582400000006 End=0.77135872000000061 MaxFreq=53100.5859375 MinFreq=42724.609375 PeakFreq=46997.0703125")
dataExtract <- function(str){
str = gsub("^<Call Begin=([0-9.]+) End=([0-9.]+) MaxFreq=([0-9.]+) MinFreq=([0-9.]+) PeakFreq=([0-9.]+)", "\\1 \\2 \\3 \\4 \\5", str)
str = unlist(strsplit(str, " "))
return(sapply(str, FUN=as.numeric, USE.NAMES=F))
}
#dataExtract(strList[[1]])
res = matrix(unlist(lapply(str, FUN=dataExtract)), ncol=5, byrow=F)
colnames(res) = c("Call Begin", "End", "MaxFreq", "MinFreq", "PeakFreq")
答案 2 :(得分:0)
这完全取决于您的数据遵循模式的严格程度。对于您提供的数据,您可以拆分&#34; &#34;和&#34; =&#34;一次性完成,只需一次性提取相关列。
result <- do.call(rbind,lapply(strList,function(s) {strsplit(s,split = "[ =]")[[1]][c(3,5,7,9,11)]}))
然后,您可以使用names()函数为列命名。