说我有这个数据:
df <- data.frame(
text = c("Treatment1: This text is","on two lines","","Treatment2:This text","has","three lines","","Treatment3: This has one")
)
df
text
1 Treatment1: This text is
2 on two lines
3
4 Treatment2:This text
5 has
6 three lines
7
8 Treatment3: This has one
我如何解析这个文本,以便所有“处理”都在他们自己的行上,同一行下面的所有文本都是?
例如,这是所需的输出:
text
1 Treatment1: This text is on two lines
2 Treatment2: This text has three lines
3 Treatment3: This has one
有人可以推荐一种方法吗?
答案 0 :(得分:2)
可能类似以下内容
首先是dput
格式的数据,这是在帖子中共享数据集的最佳格式。
df <-
structure(list(text = c("Treatment1: This text is", "on two lines",
"", "Treatment2:This text", "has", "three lines", "", "Treatment3: This has one"
)), .Names = "text", class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
现在是base R
代码。
fact <- cumsum(grepl("treatment", df$text, , ignore.case = TRUE))
result <- do.call(rbind, lapply(split(df, fact), function(x)
trimws(paste(x$text, collapse = " "))))
result <- as.data.frame(result)
names(result) <- "text"
result
# text
#1 Treatment1: This text is on two lines
#2 Treatment2:This text has three lines
#3 Treatment3: This has one
修改强>
正如Rich Scriven在评论中指出的那样,tapply
可以大大简化上面的代码。 (我没有看到,有时候我太复杂了。)
result2 <- data.frame(
text = tapply(df$text, fact, function(x) trimws(paste(x, collapse = " ")))
)
all.equal(result, result2)
#[1] "Component “text”: 'current' is not a factor"
答案 1 :(得分:0)
x <- gsub("\\s+Treatment", "*BREAK*Treatment",
as.character(paste(df[[1]], collapse = " ")))
data.frame(text = unlist(strsplit(x, "\\*BREAK\\*")))