在多个数据帧中保持前10%的观察值而不会丢失R中的数据

时间:2015-09-12 16:02:11

标签: r subset dplyr rbind

我遇到了rbind.fill多个数据帧的问题,同时对它们进行了子集化。我的一般数据框架是1x(6000-> 60,000),其中单行是给定的纪念(林肯纪念堂),列是Yelp和Trip Advisor上每个评论中的每个单词对,数字是如何多次出现单词对。

我希望将其减少到前10%的单词对,同时不会失去一个底部90%在另一个中排名前10%的机会。

four_score = c(60)
seven_years = c(100)
dataframe1 <- data.frame(four_score,seven_years)


seven_years = c(10)
our_fathers = c(40)
dataframe2 <- data.frame(seven_years,our_fathers)

four_score = c(100, 10)
our_fathers = c(NA, 40)
goal = (data.frame(four_score,our_fathers))
goal$dfName <- c("Dataframe1", "DataFrame2")

这是我的目标,我从每个DF中获取最常用的单词对(前10%)(four_score = 100,our_fathers = 40),但也能填写four_score = 10 (在DF2中,这是最低的90%,但因为它在DF1中排名前10%,所以它填补了。)

到目前为止,我的代码非常混乱如下:

library(reshape2)
library(dplyr)
library(data.table)
four_score = c(60)
seven_years = c(100)
dataframe1 <- data.frame(four_score,seven_years)
dataframe1 <- data.frame(t(dataframe1))
dataframe1$Words <- row.names(dataframe1)
colnames(dataframe1)[1] <- "Count"
dataframe1 = dataframe1[order(-dataframe1$Count),]
row.names(dataframe1)<- NULL
dfName = "dataframe1"
dataframe1  <-cbind(dataframe1,dfName)
melted_df1 <- melt(dataframe1, id=c("dfName", "Words"), measure="Count", variable.name="test")

seven_years = c(10)
our_fathers = c(40)
dataframe2 <- data.frame(seven_years,our_fathers)
dataframe2 <- data.frame(t(dataframe2))
dataframe2$Words <- row.names(dataframe2)
colnames(dataframe2)[1] <- "Count"
dataframe2 = dataframe2[order(-dataframe2$Count),]
row.names(dataframe2)<- NULL
dfName = "dataframe2"
dataframe2  <-cbind(dataframe2,dfName)
melted_df2 <- melt(dataframe2, id=c("dfName", "Words"), measure="Count", variable.name="test")

merged_melt <- rbind.fill(melted_df1, melted_df2)
merged_melt <- data.table(merged_melt)

so_close <- merged_melt[order(value, decreasing = TRUE), head(.SD, n = ceiling(.N/10)), by = dfName] %>%
  dcast.data.table(dfName ~ value)

然而,这并没有解决真正的问题 - 在第二个数据框中找到J = 10并填写它。事后是否需要%%%的东西?

1 个答案:

答案 0 :(得分:1)

您需要一个选择前x%行并提取相应字母的流程。然后返回数据集并查找包含这些字母的行。通过这种方式,您可以显示属于一个数据集中x%但不属于其他数据集的字母的所有信息。

创建2个融化数据集的那一刻尝试:

# combine all your melted datasets
df_full = rbind(melted_df1, melted_df2)


df_full %>%
  group_by(dfName) %>%                               # for each dataset
  do(.[order(-.$value),][round(nrow(.)*0.5),]) %>%   # get the top 50% after ordering by value
  ungroup() %>%
  select(Letters) %>%                                # keep the letters you found
  distinct() %>%                                     # keep distinct letters (avoid using a letter multiple times)
  inner_join(df_full, by="Letters") %>%              # join back info from initial table
  dcast(dfName~Letters)                              # reshape


    #       dfName  d   j
    # 1 dataframe1 NA 100
    # 2 dataframe2 40  10