Question

我有一个像这样的数据帧，它是由使用pdftools::pdf_text的pdf提取得到的：

page_id <- c("7", "7", "7", "8", "8")
element_id <- c("1", "2", "3", "1", "2")
text <- c("One morning,", "when Gregor Samsa woke from troubled dreams,", "he found himself transformed in his bed into a horrible vermin.", "He lay on his armour-like back, and if he lifted his head a little he could see his brown belly,", "slightly domed and divided by arches into stiff sections.")

page_id element_id                                                                                             text
1       7          1                                                                                     One morning,
2       7          2                                                     when Gregor Samsa woke from troubled dreams,
3       7          3                                  he found himself transformed in his bed into a horrible vermin.
4       8          1 He lay on his armour-like back, and if he lifted his head a little he could see his brown belly,
5       8          2                                        slightly domed and divided by arches into stiff sections.

问题是，对于以后的文本处理，我需要一个带有两个向量的数据框：page_id和每个页面的完整内容（text）。我使用以下方法拆分了df：splitted_sampledata <- split(sample_data, sample_data$page_id)

$`7`
  page_id element_id                                                            text
1       7          1                                                    One morning,
2       7          2                    when Gregor Samsa woke from troubled dreams,
3       7          3 he found himself transformed in his bed into a horrible vermin.

$`8`
  page_id element_id                                                                                             text
4       8          1 He lay on his armour-like back, and if he lifted his head a little he could see his brown belly,
5       8          2                                        slightly domed and divided by arches into stiff sections.

但是，这给我留下了一系列数据帧，而这并不是我最初想要的。为了获得理想的结果，我必须将char向量连接到文本列中，对吗？如何获得每个矢量都包含“迷你文档”的数据帧？任何帮助深表感谢！

预期输出：

page_id                                                                                                                                                       text
1       7                                  One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
2       8 He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.

如何拆分数据帧并连接char向量？

0 个答案: