Question

这个问题是我之前提出的关于repeating functions on sequentially-labeled dataframes的问题的详细阐述。

过去，我需要对data.tables从文件夹中读入R进行细微更改（例如更改日期，重新编码）。

然而，现在，我的目标有点复杂：我想从文件夹中读取几个文本文件，从这些字符vectos中随机抽取样本，将随机样本读入语料库（使用包{ {1}}）然后生成一个新的tm，其中包含单词/短语及其频率列表。

到目前为止我开发的代码如下：

data.frame

但是，虽然此功能有效，但我不确定如何只丢弃data.frames BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 5)) # Finds words or phrases files <- list.files("~/path/", full.names = TRUE, pattern="\\.txt$") # Reads in files out <- lapply(1:length(files), function(x) { df <- scan(files[x], what="", sep="\n") # Read in files df<-sample(c(df),size=1500,replace=F) # Take random sample corpus <- Corpus(VectorSource(df)) # Create corpus corpus <- tm_map(corpus, stripWhitespace) corpus <- tm_map(corpus, tolower) corpus <- tm_map(corpus, removeWords, stopwords("english")) tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)) #Create term document matrix m <- as.matrix(tdm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) # Create new dataframe with words & their frequencies })而忽略其余部分？ d是否包含out中创建的所有对象？

Answer 1

lapply函数返回一个包含指定函数返回的值的列表。在您的示例中，该函数仅返回分配给d的数据框，因此out将是仅包含d数据框的列表。函数创建的所有其他对象（例如tdm，m和v）将被丢弃，这似乎是您想要的。

您可以通过索引out中的数据框，如[[1]]所述，使用lapply（如lapply(out, function(d) d$word)）或将其与do.call('rbind', out)合并。

Answer 2

谢谢，这就是我得到的

do.call('rbind', out)

rbind（deparse.level，...）出错：列数参数不匹配

我用过

lapply(seq_along(d.names), 
       function(i,x) {assign(paste0("a",i),x[[i]], envir=.GlobalEnv)},
       x=out)

我想保留原始数据框名称，所以我做了这个

lapply(seq_along(d.names), 
   function(i,x) {assign(paste0(d.names[i],i),x[[i]], envir=.GlobalEnv)},
   x=out)

并且有效

感谢您的输入

在`lapply`中创建多个列表/矩阵/数据帧时重复顺序功能

2 个答案: