尝试此命令:
library("spacyr")
library("dplyr", warn.conflicts = FALSE)
mytext <- data.frame(text = c("test text", "section 2 sending"),
id = c(32,41))
df2 <- tidyr::separate_rows(mytext, text)
df3 <- data.frame(text = df2$text, id = df2$id)
dflemma <- spacy_parse(structure(df3$text, names = df3$id),
lemma = TRUE, pos = FALSE) %>%
mutate(id = doc_id) %>%
group_by(id) %>%
summarize(body = paste(lemma, collapse = " "))
期望的输出是使用相同ID的长到宽格式,并用空格分隔合并文本。这里是预期的输出
data.frame(text = c("test text", "section 2 send"),
id = c(32,41)
但是该命令提供此错误:
Error in process_document(x, multithread) : Docnames are duplicated.
答案 0 :(得分:2)
在您的base R
上尝试以下df3
解决方案:
#Code
dflemma <- aggregate(text~id,df3,function(x) paste(x,collapse = ' '))
输出:
id text
1 32 test text
2 41 section 2 sending
答案 1 :(得分:1)
出现此错误是因为您将每个文本短语分隔为单词。你不应该那样做。考虑以下代码:
mytext <- data.frame(text = c("test text", "section 2 sending"), id = c(32,41))
dflemma <-
spacy_parse(structure(mytext$text, names = mytext$id), lemma = TRUE, pos = FALSE) %>%
group_by(id = doc_id) %>%
summarise(text = paste(lemma, collapse = " "))
输出
> dflemma
# A tibble: 2 x 2
id text
<chr> <chr>
1 32 test text
2 41 section 2 send
更新
如果必须进行分离,则需要进一步修改id
列以确保其中的每个观察值都是唯一的。稍后,您可以在id
阶段将这些group_by
改回来。考虑以下代码。
mytext <- data.frame(text = c("test text", "section 2 sending"), id = c(32,41))
df2 <- tidyr::separate_rows(mytext, text) %>% group_by(id) %>% mutate(id = paste0(id, "-", seq_len(n())))
dflemma <-
spacy_parse(structure(df2$text, names = df2$id), lemma = TRUE, pos = FALSE) %>%
group_by(id = sub("(.+)-(.+)", "\\1", doc_id)) %>%
summarise(text = paste(lemma, collapse = " "))