我有一个像这样的数据帧,它是由使用pdftools::pdf_text
的pdf提取得到的:
page_id <- c("7", "7", "7", "8", "8")
element_id <- c("1", "2", "3", "1", "2")
text <- c("One morning,", "when Gregor Samsa woke from troubled dreams,", "he found himself transformed in his bed into a horrible vermin.", "He lay on his armour-like back, and if he lifted his head a little he could see his brown belly,", "slightly domed and divided by arches into stiff sections.")
page_id element_id text
1 7 1 One morning,
2 7 2 when Gregor Samsa woke from troubled dreams,
3 7 3 he found himself transformed in his bed into a horrible vermin.
4 8 1 He lay on his armour-like back, and if he lifted his head a little he could see his brown belly,
5 8 2 slightly domed and divided by arches into stiff sections.
问题是,对于以后的文本处理,我需要一个带有两个向量的数据框:page_id
和每个页面的完整内容(text
)。我使用以下方法拆分了df:splitted_sampledata <- split(sample_data, sample_data$page_id)
$`7`
page_id element_id text
1 7 1 One morning,
2 7 2 when Gregor Samsa woke from troubled dreams,
3 7 3 he found himself transformed in his bed into a horrible vermin.
$`8`
page_id element_id text
4 8 1 He lay on his armour-like back, and if he lifted his head a little he could see his brown belly,
5 8 2 slightly domed and divided by arches into stiff sections.
但是,这给我留下了一系列数据帧,而这并不是我最初想要的。为了获得理想的结果,我必须将char向量连接到文本列中,对吗? 如何获得每个矢量都包含“迷你文档”的数据帧?任何帮助深表感谢!
预期输出:
page_id text
1 7 One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
2 8 He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.