我有两个数据集。两者均包含较大部分的数据,即真实数据集的大约100万行乘300列。我想通过它们中的常用词将两个数据集合并在一起。另外,我想将与列和常用词对应的每个单元平均在一起,并生成第三个data.frame。我在下面有一些示例数据。
这是第一个数据集。它较小...
set.seed(511111)
#first data.frame with a smaller datasset
df<-matrix(data=rnorm(n=300,mean=10,sd=300),nrow=6,ncol=2)
words<-c("a","by","the","hi","bye","see")
df<-cbind(words,df);colnames(df)=c("y",paste0("V",c(1:2)))
df
y V1 V2
[1,] "a" "158.979716349289" "-16.2574951855564"
[2,] "by" "164.995114380192" "-68.1726437428752"
[3,] "the" "720.223066121601" "1054.04351778352"
[4,] "hi" "-288.629142240942" "537.900385284324"
[5,] "bye" "-581.097490056299" "183.495782507513"
[6,] "see" "-192.129441997881" "-117.187652711745"
这是第二个数据集。它更大
#second data.frame with a larger dataset
df2<-matrix(data=rnorm(n=300,mean=0,sd=1),nrow=10,ncol=2)
words2<-c("a","when","by","hi","was","bye","see","how","where","went")
df2<-cbind(words2,df2);colnames(df2)=c("y",paste0("V",c(1:2)))
df2
y V1 V2
[1,] "a" "2.55623583381151" "0.686246827197614"
[2,] "when" "-2.19232079339484" "-0.620807684132454"
[3,] "by" "-0.310318599027961" "-0.456190746859373"
[4,] "hi" "-0.0166971880962356" "1.21756976500452"
[5,] "was" "1.27945031935845" "-1.56033115877046"
[6,] "bye" "0.169979040969853" "0.19817006675571"
[7,] "see" "2.2791761351847" "-0.284258324796253"
[8,] "how" "1.92863014151405" "-1.27270442280769"
[9,] "where" "-1.29927355911528" "-1.45698273893523"
[10,] "went" "0.154918778937943" "-2.03576369295626"
以下是df和df2中的常用词...
#common words in df and df2 are
common.words<-c("a","by","hi","bye","see")
common.words
[1] "a" "by" "hi" "bye" "see"
我希望第三个数据集看起来像这个数据集。因此,我将取每个普通单词每列的平均值。因此,对于列V1 =(df [1,2]和df2 [1,2]),将单词=“ a”放在df3中。我将在我拥有的真实数据集中使用大约20,000个常用词来进行此操作。对于在两个数据集中都不匹配的单词,我想将这些单词扔掉,将其作为NA值,或将其值包含在每个数据集中而没有均值,因此它将是平均普通单词+唯一单词的混合df和df2。哪个更容易...
#what I want the dataset to look like after its finished merging and averaging columns V1 and V2 for common words
对于第一个值-200.365,通过取df [1,2](-399.988526255518)和df2 [1,2](“ -1.47232443999644”)的平均值来计算,该行的常用词是“ a ”。 对于第二个值8.64,通过取df [1,3](16.9236076090913)和df2 [1,3](“ -0.520509732658999”)的平均值来计算,该行的常用字为“ a”。
numbers<-data.frame(V1=c("-200.365","121.227","91.187","29.125","100.76"),
+ V2=c("8.64","80.558","-138.89","68.11","86.454"))
df3<-cbind(common.words,numbers)
df3
common.words V1 V2
1 a 80.8 -7.79
2 by 82.3 -34.3
3 bye -290. 91.8
4 hi -144. 270.
5 see -94.9 -58.7
我添加了您的解决方案来解决此问题...
df <- data.frame(df)
df2 <- data.frame(df2)
library(dplyr)
#df.list=list(df,df2)
df3<-bind_rows(df,df2) %>%
+ mutate_at(vars(starts_with("V")), as.numeric) %>%
+ filter(y %in% common.words) %>%
+ group_by(y) %>%
+ summarise_all(mean)
Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
3: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
4: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
5: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
6: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
7: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
8: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
9: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
> df3
# A tibble: 5 x 3
y V1 V2
<chr> <dbl> <dbl>
1 a 80.8 -7.79
2 by 82.3 -34.3
3 bye -290. 91.8
4 hi -144. 270.
5 see -94.9 -58.7
答案 0 :(得分:1)
将两个数据框的行绑定在一起,转换为数字,仅filter
,common.words
,group_by
y
,然后计算mean
。
library(dplyr)
bind_rows(df, df2) %>%
mutate_at(vars(starts_with("V")), as.numeric) %>%
filter(y %in% common.words) %>%
group_by(y) %>%
summarise_all(mean)
我们可以使用相同的逻辑来使用基数R aggregate
#rbind both the datasets
df1 <- rbind(df, df2)
#Convert factor numbers to numeric
df1[2:3] <- lapply(df1[2:3], function(x) as.numeric(as.character(x)))
#Filter and aggregate
aggregate(.~y, df1[df1$y %in% common.words, ], mean)
数据
df <- data.frame(df)
df2 <- data.frame(df2)