我有一个来自pdf的数据框和一些应该在一行中的文本,现在跨越不同数量的行,如下所示:
df_missing = data.frame(group = c("East","","","West","","",""),
order = c("this","is supposed to be","one line","this","is supposed to be","one line","too"))
如何更正数据框以折叠分割线
df_correct = data.frame(group = c("East","West"), order = c("this is supposed to be one line", "this is supposed to be one line too"))
答案 0 :(得分:1)
我们可以通过多种方式实现这一目标。一种方法是通过基于“组”中的非空白元素和summarise
“{顺序”中的非空白元素的逻辑向量的累积总和来创建组
paste
或者,不是创建新的分组列,而是使用library(dplyr)
df_missing %>%
group_by(group1 = cumsum(group != "")) %>%
summarise(group = first(group), order = paste(order, collapse= ' ')) %>%
select(-group1)
# A tibble: 2 x 2
# group order
# <fct> <chr>
#1 East this is supposed to be one line
#2 West this is supposed to be one line too
作为索引来填充'group'中的cumsum
非空白元素
unique
另一种选择是将空白更改为df_missing %>%
group_by(group = unique(group[group!=""])[cumsum(group != "")]) %>%
summarise(order = paste(order, collapse=' '))
,然后将NA
更改为非NA前置值,按“组”分组,fill
'顺序如上
paste
答案 1 :(得分:0)
类似的概念,如@akrun
data.table解决方案:
library(data.table)
setDT(df_missing)[,.(group=group[1], order = paste(order, collapse= ' ')),by=cumsum(group != "")][,-1]
# group order
#1: East this is supposed to be one line
#2: West this is supposed to be one line too