Question

我有一个带有两个变量的data.frame：ID和Text 我正在使用以下文本分析命令，该命令给出data.frame输出48列。

analysis <- textstat_readability(mydata$text,  measure = c("all"), remove_hyphens = TRUE)

如何将这48列结果添加为mydata中的单独列？

当前我正在使用以下内容：

analysis <- cbind(mydata$ID[1:100000], textstat_readability(mydata$text[1:100000],  measure = c("all"), remove_hyphens = TRUE))

但这需要永远的时间。

Answer 1

您有100.000条带有文本的记录。根据您的系统和每个文本记录的大小，这可能需要一段时间。您可以尝试使用更多的内核来加快过程。 Quanteda的大多数流程都是并行运行的，因此值得一试。

尝试执行以下操作以查看是否可以加快速度：

library(quanteda)
# use all available cores - 1
quanteda_options(threads = parallel::detectCores() - 1)

analyses <- textstat_readability(mydata$text[1:100000],  measure = c("all"), remove_hyphens = TRUE)

analyses <- cbind(mydata$text[1:100000], analyses)

如果您要执行一个cbind操作，则使用填充2000倍data_char_sampletext的data.frame进行测试，并没有太大区别。但这取决于mydata data.frame已经有多大。最好分两个步骤进行。

Answer 2

不确定说实话为什么您的方法要花很长时间才能完成，但是我认为正确的方法是：

# (0.) Load the package and make a random sample dataset (usually this should be
# provided in the question, just saying):

library(quanteda)
mydata <- data.frame(ID = 1:100,
                     text = stringi::stri_rand_strings(
                       n = 100, 
                       length = runif(100, min=1, max=100), 
                       pattern = "[A-Za-z0-9]"),
                     stringsAsFactors = FALSE)

# 1. Make a quanteda corpus, where the ID is stored alongside the text column:

mydata_corpus <- corpus(mydata, docid_field = "ID", text_field = "text")

# 2. Then run the readability command:

`analysis <- textstat_readability(mydata_corpus,  measure = c("all"), remove_hyphens = TRUE)`

# 3. Now you can either keep this, or merge it with your original set based on
# IDs:

mydata_analysis <- merge(mydata, analysis, by.x = "ID", by.y = "document")

这完全不需要您使用cbind()。

合并列中的数据框

2 个答案: