我试图计算数据帧中每一个字在给定时间发生的次数。这是我的数据框:
library(stringr)
df <- data.frame("Corpus" = c("this is some text",
"here is some more text text",
"more food for everyone",
"less for no one",
"something text here is some more text",
"everyone should go home",
"more random text",
"random text more more more",
"plenty of random text",
"the final piece of random everyone text"),
"Class" = c("X", "Y", "Y", "Y", "Y",
"Y", "Y", "Z",
"Z", "Z"),
"OpenTime" = c("12/01/2016 10:45:00", "11/07/2016 10:32:00",
"11/15/2015 01:45:00", "08/23/2012 1:23:00",
"12/17/2016 11:45:00", "12/16/2016 9:47:00",
"04/11/2015 04:23:00", "11/27/2016 12:12:00",
"08/25/2015 10:46:00", "09/27/2016 10:46:00"))
我想得到这个结果:
Class OpenTime Word Frequency
X 12/01/2016 10:45:00 this 1
X 12/01/2016 10:45:00 is 1
X 12/01/2016 10:45:00 some 1
X 12/01/2016 10:45:00 text 1
Y 11/07/2016 10:32:00 here 1
Y 11/07/2016 10:32:00 is 1
Y 11/07/2016 10:32:00 some 1
Y 11/07/2016 10:32:00 more 1
Y 11/07/2016 10:32:00 text 2
...
我喜欢在groupby
dplyr
中使用splits <- strsplit(as.character(df$Corpus), split = " ")
counts <- lapply(splits, table)
counts.melted <- lapply(counts, melt)
来完成所有操作,但我还没有这样做。相反,这是我尝试过的:
> counts.melted
[[1]]
Var1 value
1 is 1
2 some 1
3 text 1
4 this 1
[[2]]
Var1 value
1 here 1
2 is 1
3 more 1
4 some 1
5 text 1
...
这给了我想要的转置视图:
rep
但是如何将该熔化的矢量列表与原始数据联系起来以产生上述所需的输出?我尝试使用Class
重复for
值,因为每行中有很多单词,但收效甚微。在lapply
循环中完成所有这些操作会很容易,但我会多而是使用像out.df <- data.frame("RRN" = NULL, "OpenTime" = NULL,
"Word" = NULL, "Frequency" = NULL)
这样的矢量化方法来执行此操作。
lxml
答案 0 :(得分:0)
对于那些将来来到这里的人,我能够将大部分解决方案矢量化为我的问题。不幸的是,我仍然在寻找使用lapply
代替下面for
循环的方法,但这正是我想要的:
# split each row in the corpus column on spaces
splits <- strsplit(as.character(df$Corpus), split = " ")
# count the number of times each word in a row appears in that row
counts <- lapply(splits, table)
# melt that table to make things more palatable
counts.melted <- lapply(counts, melt)
# the result data frame to which we'll append our results
out.df <- data.frame("Class" = c(), "OpenTime" = c(),
"Word" = c(), "Frequency" = c())
# it would be better to vectorize this, using something like lapply
for(idx in 1:length(counts.melted)){
# coerce the melted table at that index to a data frame
count.df <- as.data.frame(counts.melted[idx])
# change the column names
names(count.df) <- c("Word", "Frequency")
# repeat the Classand time for that row to fill in those column
count.df[, 'Class'] <- rep(as.character(df[idx, "Class"]), nrow(count.df))
count.df[, 'OpenTime'] <- rep(as.character(df[idx, "OpenTime"]), nrow(count.df))
# append the results
out.df <- rbind(out.df, count.df)
}