Question

我有一个如下所示的数据框：

df1 <- data.frame(Question=c("This is the start", "of a question", "This is a second", "question"), 
  Answer = c("Yes", "", "No", ""))

           Question Answer
1 This is the start    Yes
2     of a question       
3  This is a second     No
4          question

这是虚拟数据，但真实数据是通过tabulizer从PDF中提取的。只要源文档中的Question中存在换行符，该问题就会分成多行。如何根据Answer为空的条件连接回来？

所需的结果很简单：

                     Question     Answer
1 This is the start of a question    Yes
2       This is a second question     No

逻辑很简单，如果Answer[x]为空，则连接Question[x]和Question[x-1]并删除行x。

Answer 1

这无疑可以改进，但如果您乐意使用tidyverse，也许这样的方法可行吗？

library(dplyr)
library(tidyr)
library(stringr)

df1 %>% 
  mutate(id = if_else(Answer != "", row_number(), NA_integer_)) %>%
  fill(id) %>% group_by(id) %>%
  summarise(Question = str_c(Question, collapse = " "), Answer = first(Answer))

#> # A tibble: 2 x 3
#>      id                        Question Answer
#>   <int>                           <chr> <fctr>
#> 1     1 This is the start of a question    Yes
#> 2     3       This is a second question     No

Answer 2

如果我遵循你的逻辑，下面应该这样做：

# test data
dff <- data.frame(Question=c("This is the start",
                             "of a question",
                             "This is a second",
                             "question",
                             "This is a third",
                             "question",
                             "and more space",
                             "yet even more space",
                             "This is actually another question"),
                  Answer = c("Yes",
                             "",
                             "No",
                             "",
                             "Yes",
                             "",
                             "",
                             "",
                             "No"),
                  stringsAsFactors = F)


# solution
do.call(rbind, lapply(split(dff, cumsum(nchar(dff$Answer)>0)), function(x) {
  data.frame(Question=paste0(x$Question, collapse=" "), Answer=head(x$Answer,1))
}))


#                                                        Question Answer
# 1                             This is the start of a question    Yes
# 2                                   This is a second question     No
# 3 This is a third question and more space yet even more space    Yes
# 4                           This is actually another question     No

我们的想法是在表达式cumsum上使用nchar(dff$Answer)>0。这应创建一个分组向量以与split函数一起使用。拆分分组矢量后，您应该能够通过连接Question列中的值并获取Answer列的第一个值来创建包含拆分操作结果的较小数据帧。随后，您可以rbind生成的数据框。

我希望这会有所帮助。

Answer 3

..使用dplyr的另一种（非常相似）方法

require(dplyr)

df1 %>% mutate(id = cumsum(!df1$Answer %in% c('Yes', 'No')),
               Q2 = ifelse(Answer == "", paste(lag(Question), Question), ""),
               A2 = ifelse(Answer == "", as.character(lag(Answer)), "")) %>%
        filter(Q2 != "") %>%
        select(id, Question = Q2, Answer = A2)

R：如何将一个分成多行的字符串连接起来？

3 个答案: