首先,我有两个独立文本来源的功能矩阵和data.frame
个功能。在每一个上,我都执行了不同的文本挖掘方法。现在,我想组合它们,但我知道其中一些列具有相同的名称,如下所示:
> dtm.matrix[1:10,66:70]
cough nasal sputum yellow intermitt
1 1 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0
> dim(dtm.matrix)
[1] 14300 6543
第二组看起来像这样:
> data1.sub[1:10,c(1,37:40)]
Data number cough coughing up blood dehydration dental abscess
1 1 0 0 0 0
2 3 1 0 0 0
3 6 0 0 0 0
4 8 0 0 0 0
5 9 0 0 0 0
6 11 1 0 0 0
7 12 0 0 0 0
8 13 0 0 0 0
9 15 0 0 0 0
10 16 1 0 0 0
> dim(data1.sub)
[1] 14300 168
我从this topic获得了此代码,但我是R的新手,我仍然需要一些帮助:
`data1.sub.merged <- dcast.data.table(merge(
## melt the first data.frame and set the key as ID and variable
setkey(melt(as.data.table(data1.sub), id.vars = "Data number"), "Data number", variable),
## melt the second data.frame
melt(as.data.table(dtm.matrix), id.vars = "Data number"),
## you'll have 2 value columns...
all = TRUE)[, value := ifelse(
## ... combine them into 1 with ifelse
(value.x == 0), value.y, value.x)],
## This is the reshaping formula
"Data number" ~ variable, value.var = "value")`
当我运行此代码时,它返回1x6667的矩阵,并且不会将两个数据集中的“cough”(或任何其他列)合并在一起。我糊涂了。你能帮我解决这个问题吗?
答案 0 :(得分:1)
有很多方法可以做到这一点,比如说。使用基数R,data.table
或dplyr
。选择取决于您的数据量,如果您使用非常大的矩阵(通常是自然语言处理和单词表示的情况),您可能需要使用不同的方法来解决您的问题并描述更好(=最快)的解决方案。
我通过dplyr
做了你想做的事。这有点难看,但它确实有效。我只是合并了两个数据帧,然后对两个数据帧中存在的变量使用for
周期:将它们相加(variable.x和variable.y)然后删除em。请注意,我更改了您的列名以获得再现性,但它不会产生任何影响。如果这对你有用,请告诉我。
df1 <- read.table(text =
' cough nasal sputum yellow intermitt
1 1 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0')
df2 <- read.table(text =
' Data_number cough coughing_up_blood dehydration dental_abscess
1 1 0 0 0 0
2 3 1 0 0 0
3 6 0 0 0 0
4 8 0 0 0 0
5 9 0 0 0 0
6 11 1 0 0 0
7 12 0 0 0 0
8 13 0 0 0 0
9 15 0 0 0 0
10 16 1 0 0 0')
# Check what variables are common
common <- intersect(names(df1),names(df2))
# Set key IDs for data
df1$ID <- seq(1,nrow(df1))
df2$ID <- seq(1,nrow(df2))
# Merge dataframes
df <- merge(df1, df2,by = "ID")
# Sum and clean common variables left in merged dataframe
library(dplyr)
for (variable in common){
# Create a summed variable
df[[variable]] <- df %>% select(starts_with(paste0(variable,"."))) %>% rowSums()
# Delete columns with .x and .y suffixes
df <- df %>% select(-one_of(c(paste0(variable,".x"), paste0(variable,".y"))))
}
df
ID nasal sputum yellow intermitt Data_number coughing_up_blood dehydration dental_abscess cough
1 1 0 0 0 0 1 0 0 0 1
2 2 0 0 0 0 3 0 0 0 2
3 3 0 0 0 0 6 0 0 0 0
4 4 0 0 0 0 8 0 0 0 0
5 5 0 0 0 0 9 0 0 0 0
6 6 0 0 0 0 11 0 0 0 2
7 7 0 0 0 0 12 0 0 0 0
8 8 0 0 0 0 13 0 0 0 0
9 9 0 0 0 0 15 0 0 0 0
10 10 0 0 0 0 16 0 0 0 1