我有一个如下所示的数据框:
df1 <- data.frame(Question=c("This is the start", "of a question", "This is a second", "question"),
Answer = c("Yes", "", "No", ""))
Question Answer
1 This is the start Yes
2 of a question
3 This is a second No
4 question
这是虚拟数据,但真实数据是通过tabulizer
从PDF中提取的。只要源文档中的Question
中存在换行符,该问题就会分成多行。如何根据Answer
为空的条件连接回来?
所需的结果很简单:
Question Answer
1 This is the start of a question Yes
2 This is a second question No
逻辑很简单,如果Answer[x]
为空,则连接Question[x]
和Question[x-1]
并删除行x
。
答案 0 :(得分:3)
这无疑可以改进,但如果您乐意使用tidyverse
,也许这样的方法可行吗?
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(id = if_else(Answer != "", row_number(), NA_integer_)) %>%
fill(id) %>% group_by(id) %>%
summarise(Question = str_c(Question, collapse = " "), Answer = first(Answer))
#> # A tibble: 2 x 3
#> id Question Answer
#> <int> <chr> <fctr>
#> 1 1 This is the start of a question Yes
#> 2 3 This is a second question No
答案 1 :(得分:1)
如果我遵循你的逻辑,下面应该这样做:
# test data
dff <- data.frame(Question=c("This is the start",
"of a question",
"This is a second",
"question",
"This is a third",
"question",
"and more space",
"yet even more space",
"This is actually another question"),
Answer = c("Yes",
"",
"No",
"",
"Yes",
"",
"",
"",
"No"),
stringsAsFactors = F)
# solution
do.call(rbind, lapply(split(dff, cumsum(nchar(dff$Answer)>0)), function(x) {
data.frame(Question=paste0(x$Question, collapse=" "), Answer=head(x$Answer,1))
}))
# Question Answer
# 1 This is the start of a question Yes
# 2 This is a second question No
# 3 This is a third question and more space yet even more space Yes
# 4 This is actually another question No
我们的想法是在表达式cumsum
上使用nchar(dff$Answer)>0
。这应创建一个分组向量以与split
函数一起使用。拆分分组矢量后,您应该能够通过连接Question
列中的值并获取Answer
列的第一个值来创建包含拆分操作结果的较小数据帧。随后,您可以rbind
生成的数据框。
我希望这会有所帮助。
答案 2 :(得分:0)
..使用dplyr的另一种(非常相似)方法
require(dplyr)
df1 %>% mutate(id = cumsum(!df1$Answer %in% c('Yes', 'No')),
Q2 = ifelse(Answer == "", paste(lag(Question), Question), ""),
A2 = ifelse(Answer == "", as.character(lag(Answer)), "")) %>%
filter(Q2 != "") %>%
select(id, Question = Q2, Answer = A2)