我遇到了rbind.fill多个数据帧的问题,同时对它们进行了子集化。我的一般数据框架是1x(6000-> 60,000),其中单行是给定的纪念(林肯纪念堂),列是Yelp和Trip Advisor上每个评论中的每个单词对,数字是如何多次出现单词对。
我希望将其减少到前10%的单词对,同时不会失去一个底部90%在另一个中排名前10%的机会。
four_score = c(60)
seven_years = c(100)
dataframe1 <- data.frame(four_score,seven_years)
seven_years = c(10)
our_fathers = c(40)
dataframe2 <- data.frame(seven_years,our_fathers)
four_score = c(100, 10)
our_fathers = c(NA, 40)
goal = (data.frame(four_score,our_fathers))
goal$dfName <- c("Dataframe1", "DataFrame2")
这是我的目标,我从每个DF中获取最常用的单词对(前10%)(four_score = 100,our_fathers = 40),但也能填写four_score = 10 (在DF2中,这是最低的90%,但因为它在DF1中排名前10%,所以它填补了。)
到目前为止,我的代码非常混乱如下:
library(reshape2)
library(dplyr)
library(data.table)
four_score = c(60)
seven_years = c(100)
dataframe1 <- data.frame(four_score,seven_years)
dataframe1 <- data.frame(t(dataframe1))
dataframe1$Words <- row.names(dataframe1)
colnames(dataframe1)[1] <- "Count"
dataframe1 = dataframe1[order(-dataframe1$Count),]
row.names(dataframe1)<- NULL
dfName = "dataframe1"
dataframe1 <-cbind(dataframe1,dfName)
melted_df1 <- melt(dataframe1, id=c("dfName", "Words"), measure="Count", variable.name="test")
seven_years = c(10)
our_fathers = c(40)
dataframe2 <- data.frame(seven_years,our_fathers)
dataframe2 <- data.frame(t(dataframe2))
dataframe2$Words <- row.names(dataframe2)
colnames(dataframe2)[1] <- "Count"
dataframe2 = dataframe2[order(-dataframe2$Count),]
row.names(dataframe2)<- NULL
dfName = "dataframe2"
dataframe2 <-cbind(dataframe2,dfName)
melted_df2 <- melt(dataframe2, id=c("dfName", "Words"), measure="Count", variable.name="test")
merged_melt <- rbind.fill(melted_df1, melted_df2)
merged_melt <- data.table(merged_melt)
so_close <- merged_melt[order(value, decreasing = TRUE), head(.SD, n = ceiling(.N/10)), by = dfName] %>%
dcast.data.table(dfName ~ value)
然而,这并没有解决真正的问题 - 在第二个数据框中找到J = 10并填写它。事后是否需要%%%的东西?
答案 0 :(得分:1)
您需要一个选择前x%行并提取相应字母的流程。然后返回数据集并查找包含这些字母的行。通过这种方式,您可以显示属于一个数据集中x%但不属于其他数据集的字母的所有信息。
创建2个融化数据集的那一刻尝试:
# combine all your melted datasets
df_full = rbind(melted_df1, melted_df2)
df_full %>%
group_by(dfName) %>% # for each dataset
do(.[order(-.$value),][round(nrow(.)*0.5),]) %>% # get the top 50% after ordering by value
ungroup() %>%
select(Letters) %>% # keep the letters you found
distinct() %>% # keep distinct letters (avoid using a letter multiple times)
inner_join(df_full, by="Letters") %>% # join back info from initial table
dcast(dfName~Letters) # reshape
# dfName d j
# 1 dataframe1 NA 100
# 2 dataframe2 40 10