更新1
链接实际数据集,因为为示例数据提供的解决方案对我来说不起作用。
链接:https://app.box.com/s/65j1enr13pi51i44mfrymccklw1artot
请注意,LOT
是行标记的结尾。
-
我的数据框如下(单列):
D
2
f
h
k
END_ROW_WORD
k
1
2
END_ROW_WORD
e
g
j
2
k
END_ROW_WORD
我想将其转换为以下格式:
正如您所看到的那样,有一个特定的单词(END_ROW_WORD)标记了行的结尾。
答案 0 :(得分:1)
以下是Alejandro的类似方法,但使用split
代替for
循环:
colstarts <- diff(c(0, which(df == "END_ROW_WORD")))
rows <- split(df[[1]], rep(1:length(colstarts), colstarts))
rows <- lapply(rows, `length<-`, max(lengths(rows)))
as.data.frame(do.call(rbind, rows))
答案 1 :(得分:1)
没有for
的解决方案 - 循环,但stringr
library(stringr)
new_text <- str_c(df$V1, collapse = " ")
new_text <- str_replace_all(new_text, "END_ROW_WORD", "END_ROW_WORD\n")
read.table(text = new_text, fill = T)
# V1 V2 V3 V4 V5 V6
# 1 D 2 f h k END_ROW_WORD
# 2 k 1 2 END_ROW_WORD
# 3 e g j 2 k END_ROW_WORD
数据强>
df <-
structure(list(V1 = structure(c(3L, 2L, 6L, 8L, 10L, 5L, 10L, 1L, 2L, 5L, 4L, 7L, 9L, 2L, 10L, 5L),
.Label = c("1", "2", "D", "e", "END_ROW_WORD", "f", "g", "h", "j", "k"),
class = "factor")),
.Names = "V1", class = "data.frame", row.names = c(NA, -16L))
答案 2 :(得分:0)
这可能不是最好的方法,但它有效
pos_help = which(grepl("END_ROW_WORD",data))
d = list()
for(i in 1:length(pos_help)){
if(i == 1){
d[[i]] = data[1:pos_help[1]]
} else {
d[[i]] = data[(pos_help[i-1]+1):pos_help[i]]
}
}
dataFrame = do.call(rbind,lapply(d, "length<-", max(lengths(d))))
答案 3 :(得分:0)
首先在每个"\n"
标记之后添加换行符"END_ROW_WORD"
,然后将结果粘贴到一个长字符串中。
然后,它使用read.table
从文本连接中读取数据。
end <- "END_ROW_WORD"
inx <- c(0, grep(end, dat[[1]]))
s <- NULL
for(i in seq_along(inx)[-1]){
s <- c(s, dat[[1]][(inx[(i - 1)] + 1):inx[i]], "\n")
}
con <- textConnection(paste(s, collapse = " "))
result <- read.table(con, fill = TRUE)
close(con)
result
# V1 V2 V3 V4 V5 V6
#1 D 2 f h k END_ROW_WORD
#2 k 1 2 END_ROW_WORD
#3 e g j 2 k END_ROW_WORD
DATA。
dat <-
structure(list(V1 = c("D", "2", "f", "h", "k", "END_ROW_WORD",
"k", "1", "2", "END_ROW_WORD", "e", "g", "j", "2", "k", "END_ROW_WORD"
)), .Names = "V1", class = "data.frame", row.names = c(NA, -16L
))
编辑。
在OP编辑问题之后,我修改了代码以查看该文件是否可以正确读入data.frame
。主要困难是该文件有许多不可打印的字符,read.table
无法到达文件末尾。
积分转到read.csv warning 'EOF within quoted string' prevents complete reading of file中接受的答案。我对问题和答案都表示赞同。
Credits 也必须提供给@kath,在答案中,使用字符串替换将换行符作为EOL标记的想法比我上面的丑陋for
循环要好得多。与kath不同,我只使用base R
,我不认为有必要加载外部包。
现在修改了代码。
# Use this first pattern if AUCTION also marks the end of a row
#pattern <- "(^LOT|^AUCTION)"
pattern <- "(^LOT)"
dat <- readLines("data_.csv")
s <- gsub("[[:cntrl:]]", "", dat)
s <- sub(pattern, "\\1\n", s)
con <- textConnection(paste(s, collapse = "\t"))
result <- read.table(con, sep = "\t", fill = TRUE, quote = "", row.names = NULL)
close(con)
head(result)
tail(result)
str(result)
我认为会有一些空行,所以我用以下代码检查了它。
#
# See if there are any empty rows
#
empty <- apply(result, 1, function(x) nchar(trimws(paste0(x, collapse = ""))) == 0)
sum(empty)
#[1] 0
答案 4 :(得分:0)
没有循环,但使用map和split ....(因为为什么不:p)
library(tidyverse)
df <- tibble(x=c(
"D",
"2",
"f",
"h",
"k",
"END_ROW_WORD",
"k",
"1",
"2",
"END_ROW_WORD",
"e",
"g",
"j",
"2",
"k",
"END_ROW_WORD"
)
)
split(df,cut(1:16,breaks=c(0,which(df == "END_ROW_WORD")))) %>%
map_dfc(~rbind(.x,tibble(x=rep(NA,(6-nrow(.x)))))) %>%
t() %>% as.data.frame()