Question

我试图计算数据帧中每一个字在给定时间发生的次数。这是我的数据框：

library(stringr)

df <- data.frame("Corpus" = c("this is some text", 
                              "here is some more text text",
                              "more food for everyone",
                              "less for no one",
                              "something text here is some more text",
                              "everyone should go home",
                              "more random text",
                              "random text more more more",
                              "plenty of random text",
                              "the final piece of random everyone text"),

                 "Class" = c("X", "Y", "Y", "Y", "Y",
                           "Y", "Y", "Z",
                           "Z", "Z"),

                 "OpenTime" = c("12/01/2016 10:45:00", "11/07/2016 10:32:00",
                                "11/15/2015 01:45:00", "08/23/2012 1:23:00",
                                "12/17/2016 11:45:00", "12/16/2016 9:47:00",
                                "04/11/2015 04:23:00", "11/27/2016 12:12:00",
                                "08/25/2015 10:46:00", "09/27/2016 10:46:00"))

我想得到这个结果：

Class    OpenTime             Word    Frequency
X        12/01/2016 10:45:00  this    1
X        12/01/2016 10:45:00  is      1
X        12/01/2016 10:45:00  some    1
X        12/01/2016 10:45:00  text    1
Y        11/07/2016 10:32:00  here    1
Y        11/07/2016 10:32:00  is      1
Y        11/07/2016 10:32:00  some    1
Y        11/07/2016 10:32:00  more    1
Y        11/07/2016 10:32:00  text    2
...

我喜欢在groupby dplyr中使用splits <- strsplit(as.character(df$Corpus), split = " ") counts <- lapply(splits, table) counts.melted <- lapply(counts, melt)来完成所有操作，但我还没有这样做。相反，这是我尝试过的：

> counts.melted
[[1]]
  Var1 value
1   is     1
2 some     1
3 text     1
4 this     1

[[2]]
  Var1 value
1 here     1
2   is     1
3 more     1
4 some     1
5 text     1
...

这给了我想要的转置视图：

rep

但是如何将该熔化的矢量列表与原始数据联系起来以产生上述所需的输出？我尝试使用Class重复for值，因为每行中有很多单词，但收效甚微。在lapply循环中完成所有这些操作会很容易，但我会多而是使用像out.df <- data.frame("RRN" = NULL, "OpenTime" = NULL, "Word" = NULL, "Frequency" = NULL)这样的矢量化方法来执行此操作。

lxml

Answer 1

对于那些将来来到这里的人，我能够将大部分解决方案矢量化为我的问题。不幸的是，我仍然在寻找使用lapply代替下面for循环的方法，但这正是我想要的：

# split each row in the corpus column on spaces
splits <- strsplit(as.character(df$Corpus), split = " ")

# count the number of times each word in a row appears in that row
counts <- lapply(splits, table)

# melt that table to make things more palatable
counts.melted <- lapply(counts, melt)

# the result data frame to which we'll append our results
out.df <- data.frame("Class" = c(), "OpenTime" = c(), 
                     "Word" = c(), "Frequency" = c())

# it would be better to vectorize this, using something like lapply
for(idx in 1:length(counts.melted)){

  # coerce the melted table at that index to a data frame
  count.df <- as.data.frame(counts.melted[idx])

  # change the column names
  names(count.df) <- c("Word", "Frequency")

  # repeat the Classand time for that row to fill in those column
  count.df[, 'Class'] <- rep(as.character(df[idx, "Class"]), nrow(count.df))
  count.df[, 'OpenTime'] <- rep(as.character(df[idx, "OpenTime"]), nrow(count.df))

  # append the results
  out.df <- rbind(out.df, count.df)
}

将熔化的表对象绑回原始数据帧？

1 个答案: