非常感谢您对此的帮助!
我有~4.5k txt文件,如下所示:
Simple statistics using MSPA parameters: 8_3_1_1 on input file: 20130815 104359 875 000000 0528 0548_result.tif
MSPA-class [color]: Foreground/data pixels [%] Frequency
============================================================
CORE(s) [green]: -- 0
CORE(m) [green]: 48.43/13.45 1
CORE(l) [green]: -- 0
ISLET [brown]: 3.70/ 1.03 20
PERFORATION [blue]: 0.00/ 0.00 0
EDGE [black]: 30.93/ 8.59 11
LOOP [yellow]: 9.66/ 2.68 6
BRIDGE [red]: 0.00/ 0.00 0
BRANCH [orange]: 7.28/ 2.02 40
Background [grey]: --- /72.22 11
Missing [white]: 0.00 0
我想将目录中的所有txt文件读入R,然后在将它们合并之前对它们执行重新排列任务。
txt文件中的值可以更改,因此在现在有0.00的位置,可能是某些文件中的相关数字(所以我们需要这些)。对于那里的字段 - 现在,如果脚本可以测试是否有 - 或数字,那将是很好的。如果有 - ,那么它应该变成NA。另一方面,真正的0.00值是有价值的,我需要它们。缺失白色列(或此处的行)只有一个值,然后应将该值复制到两个列中,前景%和数据像素%。
我需要的一般重新排列是将所有数据作为列提供,每个txt文件只有1行。对于此处txt文件中的每一行数据,输出文件中应该有3列(前景%,数据像素%和每种颜色的频率)。该行的名称应该是文件开头提到的图像名称,在这里:20130815 104359 875 000000 0528 0548
其余部分可以省略。
输出应该如下所示:
我正在同时研究这个问题,但我不确定采取哪个方向。所以任何帮助都非常受欢迎!
最佳, Moritz的
答案 0 :(得分:0)
我将数据复制并粘贴到文本文件中并调整空间以使它们之间保持一致。您可能想要这样做,或者如果您可以附加文本文件,它将很容易使用。您可以使用pastebin - http://en.wikipedia.org/wiki/Pastebin
首先按如下方式设置工作目录:
setwd("path of your file")
#EDIT:创建所有文件的单个数据框
split.row.data <- function(x){
a1 = sub("( )+(.*)", '\\2', x)
b1 = unlist(strsplit(sub("( )+(.*)", '\\2', (strsplit(a1, ":"))[[1]][2]), " "))
c1 = unlist(strsplit(b1[1], "/"))
if(length(c1) == 1){
if(which(b1[1:2] %in% "") == 1){
c1 = c(NA, c1)
}else if(which(b1[1:2] %in% "") == 2){
c1 = c(c1, NA)
}
}
c1[which(c1 %in% c("--", "--- "))] <- NA
return(c(unlist(strsplit(strsplit(a1, ":")[[1]][1], " ")),
c1,
b1[length(b1)]))
}
df2 <- data.frame(matrix(nrow = 1, ncol = 6), stringsAsFactors = FALSE)
file_list = list.files('~/desktop', pattern = '^image\\d+.txt', full.names = TRUE)
for (infile in file_list){
file_data <- readLines(con <- file(infile))
close(con)
filename = sub("(.*)(input file:)(.*)(.tif)", "\\3", file_data[3])
a2 <- file_data[7:length(file_data)]
d1 = lapply(a2, function(x) split.row.data(x))
df1 <- data.frame(matrix(nrow= length(d1), ncol = 5), stringsAsFactors = FALSE)
for(i in 1:length(d1)){df1[i, ] <- d1[[i]]}
df1 <- cbind(data.frame(rep(filename, nrow(df1)), stringsAsFactors = FALSE),
df1)
colnames(df1) <- colnames(df2)
df2 <- rbind(df2, df1)
}
df2 <- df2[2:nrow(df2), ]
df2[,4] <- as.numeric(df2[,4])
df2[,5] <- as.numeric(df2[,5])
df2[,6] <- as.numeric(df2[,6])
e1 = unlist(lapply(df2[,3], function(x) gsub(']', '', x)))
df2[,3] = unlist(lapply(e1, function(x) gsub("[[]", '', x)))
header_names <- unlist(lapply(strsplit(file_data[5], "/"), function(x) strsplit(x, " ")))
colnames(df2) <- c("filename",
strsplit(header_names[1], " ")[[1]][2],
"color",
header_names[2:length(header_names)])
row.names(df2) <- 1:nrow(df2)
输出:
print(head(df2))
filename MSPA-class color Foreground data pixels [%] Frequency
1 20130815 103739 599 000000 0944 0788 CORE(s) green NA NA 0
2 20130815 103739 599 000000 0944 0788 CORE(m) green 63.46 17.41 1
3 20130815 103739 599 000000 0944 0788 CORE(l) green NA NA 0
4 20130815 103739 599 000000 0944 0788 ISLET brown 0.00 0.00 0
5 20130815 103739 599 000000 0944 0788 PERFORATION blue 0.00 0.00 0
6 20130815 103739 599 000000 0944 0788 EDGE black 35.00 9.60 1
#get数据仅用于&#34;背景&#34;来自&#34; MSPA-class&#34;柱
df2_background <- df2[which(df2[, "MSPA-class"] %in% "Background"), ]
print(df2_background)
filename MSPA-class color Foreground data pixels [%] Frequency
11 20130815 103739 599 000000 0944 0788 Background grey NA 72.57 1
22 20130815 143233 712 000000 1048 0520 Background grey NA 77.51 1
33 20130902 163929 019 000000 0394 0290 Background grey NA 54.55 6
答案 1 :(得分:0)
我认为这会以您想要的格式出现,但示例与您的图片不符,所以我无法确定:
(lf <- list.files('~/desktop', pattern = '^image\\d+.txt', full.names = TRUE))
# [1] "/Users/rawr/desktop/image001.txt" "/Users/rawr/desktop/image002.txt"
# [3] "/Users/rawr/desktop/image003.txt"
lapply(lf, function(xx) {
rl <- readLines(con <- file(xx), warn = FALSE)
close(con)
## assuming the file name is after "file: " until the end of the string
## and ends in .tif
img_name <- gsub('.*file:\\s+(.*).tif', '\\1', rl[1])
## removes each string up to and including the ===== string
rl <- rl[-(1:grep('==', rl))]
## remove leading whitespace
rl <- gsub('^\\s+', '', rl)
## split the remaining lines by larger chunks of whitespace
mat <- do.call('rbind', strsplit(rl, '\\s{2, }'))
## more cleaning, setting attributes, etc
mat[mat == '--'] <- NA
mat <- cbind(image_name = img_name, `colnames<-`(t(mat[, 2]), mat[, 1]))
as.data.frame(mat)
})
我使用您的示例创建了三个文件,并使每个文件略有不同,以显示这对包含多个文件的目录的工作方式:
# [[1]]
# image_name CORE(s) [green]: CORE(m) [green]: CORE(l) [green]: ISLET [brown]: PERFORATION [blue]: EDGE [black]: LOOP [yellow]: BRIDGE [red]: BRANCH [orange]: Background [grey]: Missing [white]:
# 1 20130815 104359 875 000000 0528 0548_result <NA> 48.43/13.45 <NA> 3.70/ 1.03 0.00/ 0.00 30.93/ 8.59 9.66/ 2.68 0.00/ 0.00 7.28/ 2.02 --- /72.22 0.00
#
# [[2]]
# image_name CORE(s) [green]: CORE(m) [green]: CORE(l) [green]: ISLET [brown]: PERFORATION [blue]: EDGE [black]: LOOP [yellow]: BRIDGE [red]: BRANCH [orange]: Background [grey]: Missing [white]:
# 1 20139341 104359 875 000000 0528 0548_result 23 48.43/13.45 23 <NA> 0.00/ 0.00 30.93/ 8.59 9.66/ 2.68 0.00/ 0.00 7.28/ 2.02 --- /72.22 0.00
#
# [[3]]
# image_name CORE(s) [green]: CORE(m) [green]: CORE(l) [green]: ISLET [brown]: PERFORATION [blue]: EDGE [black]: LOOP [yellow]: BRIDGE [red]: BRANCH [orange]: Background [grey]: Missing [white]:
# 1 20132343 104359 875 000000 0528 0548_result <NA> <NA> <NA> <NA> <NA> 30.93/ 8.59 9.66/ 2.68 0.00/ 0.00 7.28/ 2.02 <NA> 0.00
修改
进行了一些更改以提取所有信息:
(lf <- list.files('~/desktop', pattern = '^image\\d+.txt', full.names = TRUE))
# [1] "/Users/rawr/desktop/image001.txt" "/Users/rawr/desktop/image002.txt"
# [3] "/Users/rawr/desktop/image003.txt"
res <- lapply(lf, function(xx) {
rl <- readLines(con <- file(xx), warn = FALSE)
close(con)
img_name <- gsub('.*file:\\s+(.*).tif', '\\1', rl[1])
rl <- rl[-(1:grep('==', rl))]
rl <- gsub('^\\s+', '', rl)
mat <- do.call('rbind', strsplit(rl, '\\s{2, }'))
dat <- as.data.frame(mat, stringsAsFactors = FALSE)
tmp <- `colnames<-`(do.call('rbind', strsplit(dat$V2, '[-\\/\\s]+', perl = TRUE)),
c('Foreground','Data pixels'))
dat <- cbind(dat[, -2], tmp, image_name = img_name)
dat[] <- lapply(dat, as.character)
dat[dat == ''] <- NA
names(dat)[1:2] <- c('MSPA-class','Frequency')
zzz <- reshape(dat, direction = 'wide', idvar = 'image_name', timevar = 'MSPA-class')
names(zzz)[-1] <- gsub('(.*)\\.(.*) (?:.*)', '\\2_\\1', names(zzz)[-1], perl = TRUE)
zzz
})
这里是结果(我只是转换成一个长矩阵,因此它更容易阅读。真正的结果是在一个非常宽的数据框中,每个文件一个):
`rownames<-`(matrix(res[[1]]), names(res[[1]]))
# [,1]
# image_name "20130815 104359 875 000000 0528 0548_result"
# CORE(s)_Frequency "0"
# CORE(s)_Foreground "NA"
# CORE(s)_Data pixels "NA"
# CORE(m)_Frequency "1"
# CORE(m)_Foreground "48.43"
# CORE(m)_Data pixels "13.45"
# CORE(l)_Frequency "0"
# CORE(l)_Foreground "NA"
# CORE(l)_Data pixels "NA"
# ISLET_Frequency "20"
# ISLET_Foreground "3.70"
# ISLET_Data pixels "1.03"
# PERFORATION_Frequency "0"
# PERFORATION_Foreground "0.00"
# PERFORATION_Data pixels "0.00"
# EDGE_Frequency "11"
# EDGE_Foreground "30.93"
# EDGE_Data pixels "8.59"
# LOOP_Frequency "6"
# LOOP_Foreground "9.66"
# LOOP_Data pixels "2.68"
# BRIDGE_Frequency "0"
# BRIDGE_Foreground "0.00"
# BRIDGE_Data pixels "0.00"
# BRANCH_Frequency "40"
# BRANCH_Foreground "7.28"
# BRANCH_Data pixels "2.02"
# Background_Frequency "11"
# Background_Foreground "NA"
# Background_Data pixels "72.22"
# Missing_Frequency "0"
# Missing_Foreground "0.00"
# Missing_Data pixels "0.00"
包含您的样本数据:
lf <- list.files('~/desktop/data', pattern = '.txt', full.names = TRUE)
`rownames<-`(matrix(res[[1]]), names(res[[1]]))
# [,1]
# image_name "20130815 103704 780 000000 0372 0616"
# CORE(s)_Frequency "0"
# CORE(s)_Foreground "NA"
# CORE(s)_Data pixels "NA"
# CORE(m)_Frequency "1"
# CORE(m)_Foreground "54.18"
# CORE(m)_Data pixels "15.16"
# CORE(l)_Frequency "0"
# CORE(l)_Foreground "NA"
# CORE(l)_Data pixels "NA"
# ISLET_Frequency "11"
# ISLET_Foreground "3.14"
# ISLET_Data pixels "0.88"
# PERFORATION_Frequency "0"
# PERFORATION_Foreground "0.00"
# PERFORATION_Data pixels "0.00"
# EDGE_Frequency "1"
# EDGE_Foreground "34.82"
# EDGE_Data pixels "9.75"
# LOOP_Frequency "1"
# LOOP_Foreground "4.96"
# LOOP_Data pixels "1.39"
# BRIDGE_Frequency "0"
# BRIDGE_Foreground "0.00"
# BRIDGE_Data pixels "0.00"
# BRANCH_Frequency "20"
# BRANCH_Foreground "2.89"
# BRANCH_Data pixels "0.81"
# Background_Frequency "1"
# Background_Foreground "NA"
# Background_Data pixels "72.01"
# Missing_Frequency "0"
# Missing_Foreground "0.00"
# Missing_Data pixels "0.00"