R - 合并/组合具有相同名称的列,但某些数据值等于零

时间:2017-07-29 22:02:06

标签: r merge

首先,我有两个独立文本来源的功能矩阵和data.frame个功能。在每一个上,我都执行了不同的文本挖掘方法。现在,我想组合它们,但我知道其中一些列具有相同的名称,如下所示:

> dtm.matrix[1:10,66:70]
       cough nasal sputum yellow intermitt
    1      1     0      0      0         0
    2      1     0      0      0         0
    3      0     0      0      0         0
    4      0     0      0      0         0
    5      0     0      0      0         0
    6      1     0      0      0         0
    7      0     0      0      0         0
    8      0     0      0      0         0
    9      0     0      0      0         0
    10     0     0      0      0         0

> dim(dtm.matrix) [1] 14300 6543

第二组看起来像这样:

    > data1.sub[1:10,c(1,37:40)]
   Data number cough coughing up blood dehydration dental abscess
1            1     0                 0           0              0
2            3     1                 0           0              0
3            6     0                 0           0              0
4            8     0                 0           0              0
5            9     0                 0           0              0
6           11     1                 0           0              0
7           12     0                 0           0              0
8           13     0                 0           0              0
9           15     0                 0           0              0
10          16     1                 0           0              0
> dim(data1.sub)
[1] 14300   168

我从this topic获得了此代码,但我是R的新手,我仍然需要一些帮助:

    `data1.sub.merged <- dcast.data.table(merge(
    ## melt the first data.frame and set the key as ID and variable
    setkey(melt(as.data.table(data1.sub), id.vars = "Data number"), "Data number", variable), 
  ## melt the second data.frame
  melt(as.data.table(dtm.matrix), id.vars = "Data number"), 
  ## you'll have 2 value columns...
  all = TRUE)[, value := ifelse(
  ## ... combine them into 1 with ifelse
  (value.x == 0), value.y, value.x)], 
  ## This is the reshaping formula
  "Data number" ~ variable, value.var = "value")`

当我运行此代码时,它返回1x6667的矩阵,并且不会将两个数据集中的“cough”(或任何其他列)合并在一起。我糊涂了。你能帮我解决这个问题吗?

1 个答案:

答案 0 :(得分:1)

有很多方法可以做到这一点,比如说。使用基数R,data.tabledplyr。选择取决于您的数据量,如果您使用非常大的矩阵(通常是自然语言处理和单词表示的情况),您可能需要使用不同的方法来解决您的问题并描述更好(=最快)的解决方案。 我通过dplyr做了你想做的事。这有点难看,但它确实有效。我只是合并了两个数据帧,然后对两个数据帧中存在的变量使用for周期:将它们相加(variable.x和variable.y)然后删除em。请注意,我更改了您的列名以获得再现性,但它不会产生任何影响。如果这对你有用,请告诉我。

df1 <- read.table(text = 
'     cough nasal sputum yellow intermitt
1      1     0      0      0         0
2      1     0      0      0         0
3      0     0      0      0         0
4      0     0      0      0         0
5      0     0      0      0         0
6      1     0      0      0         0
7      0     0      0      0         0
8      0     0      0      0         0
9      0     0      0      0         0
10     0     0      0      0         0')

df2 <- read.table(text = 
'   Data_number cough coughing_up_blood dehydration dental_abscess
1            1     0                 0           0              0
2            3     1                 0           0              0
3            6     0                 0           0              0
4            8     0                 0           0              0
5            9     0                 0           0              0
6           11     1                 0           0              0
7           12     0                 0           0              0
8           13     0                 0           0              0
9           15     0                 0           0              0
10          16     1                 0           0              0')

# Check what variables are common
common <- intersect(names(df1),names(df2))

# Set key IDs for data
df1$ID <- seq(1,nrow(df1))
df2$ID <- seq(1,nrow(df2))

# Merge dataframes
df <- merge(df1, df2,by = "ID")

# Sum and clean common variables left in merged dataframe
library(dplyr)

for (variable in common){
  # Create a summed variable
  df[[variable]] <- df %>% select(starts_with(paste0(variable,"."))) %>% rowSums()
  # Delete columns with .x and .y suffixes
  df <- df %>% select(-one_of(c(paste0(variable,".x"), paste0(variable,".y"))))
}

df
   ID nasal sputum yellow intermitt Data_number coughing_up_blood dehydration dental_abscess cough
1   1     0      0      0         0           1                 0           0              0     1
2   2     0      0      0         0           3                 0           0              0     2
3   3     0      0      0         0           6                 0           0              0     0
4   4     0      0      0         0           8                 0           0              0     0
5   5     0      0      0         0           9                 0           0              0     0
6   6     0      0      0         0          11                 0           0              0     2
7   7     0      0      0         0          12                 0           0              0     0
8   8     0      0      0         0          13                 0           0              0     0
9   9     0      0      0         0          15                 0           0              0     0
10 10     0      0      0         0          16                 0           0              0     1