Question

首先，我有两个独立文本来源的功能矩阵和data.frame个功能。在每一个上，我都执行了不同的文本挖掘方法。现在，我想组合它们，但我知道其中一些列具有相同的名称，如下所示：

> dtm.matrix[1:10,66:70]
       cough nasal sputum yellow intermitt
    1      1     0      0      0         0
    2      1     0      0      0         0
    3      0     0      0      0         0
    4      0     0      0      0         0
    5      0     0      0      0         0
    6      1     0      0      0         0
    7      0     0      0      0         0
    8      0     0      0      0         0
    9      0     0      0      0         0
    10     0     0      0      0         0

> dim(dtm.matrix) [1] 14300 6543

第二组看起来像这样：

    > data1.sub[1:10,c(1,37:40)]
   Data number cough coughing up blood dehydration dental abscess
1            1     0                 0           0              0
2            3     1                 0           0              0
3            6     0                 0           0              0
4            8     0                 0           0              0
5            9     0                 0           0              0
6           11     1                 0           0              0
7           12     0                 0           0              0
8           13     0                 0           0              0
9           15     0                 0           0              0
10          16     1                 0           0              0
> dim(data1.sub)
[1] 14300   168

我从this topic获得了此代码，但我是R的新手，我仍然需要一些帮助：

    `data1.sub.merged <- dcast.data.table(merge(
    ## melt the first data.frame and set the key as ID and variable
    setkey(melt(as.data.table(data1.sub), id.vars = "Data number"), "Data number", variable), 
  ## melt the second data.frame
  melt(as.data.table(dtm.matrix), id.vars = "Data number"), 
  ## you'll have 2 value columns...
  all = TRUE)[, value := ifelse(
  ## ... combine them into 1 with ifelse
  (value.x == 0), value.y, value.x)], 
  ## This is the reshaping formula
  "Data number" ~ variable, value.var = "value")`

当我运行此代码时，它返回1x6667的矩阵，并且不会将两个数据集中的“cough”（或任何其他列）合并在一起。我糊涂了。你能帮我解决这个问题吗？

Answer 1

有很多方法可以做到这一点，比如说。使用基数R，data.table或dplyr。选择取决于您的数据量，如果您使用非常大的矩阵（通常是自然语言处理和单词表示的情况），您可能需要使用不同的方法来解决您的问题并描述更好（=最快）的解决方案。我通过dplyr做了你想做的事。这有点难看，但它确实有效。我只是合并了两个数据帧，然后对两个数据帧中存在的变量使用for周期：将它们相加（variable.x和variable.y）然后删除em。请注意，我更改了您的列名以获得再现性，但它不会产生任何影响。如果这对你有用，请告诉我。

df1 <- read.table(text = 
'     cough nasal sputum yellow intermitt
1      1     0      0      0         0
2      1     0      0      0         0
3      0     0      0      0         0
4      0     0      0      0         0
5      0     0      0      0         0
6      1     0      0      0         0
7      0     0      0      0         0
8      0     0      0      0         0
9      0     0      0      0         0
10     0     0      0      0         0')

df2 <- read.table(text = 
'   Data_number cough coughing_up_blood dehydration dental_abscess
1            1     0                 0           0              0
2            3     1                 0           0              0
3            6     0                 0           0              0
4            8     0                 0           0              0
5            9     0                 0           0              0
6           11     1                 0           0              0
7           12     0                 0           0              0
8           13     0                 0           0              0
9           15     0                 0           0              0
10          16     1                 0           0              0')

# Check what variables are common
common <- intersect(names(df1),names(df2))

# Set key IDs for data
df1$ID <- seq(1,nrow(df1))
df2$ID <- seq(1,nrow(df2))

# Merge dataframes
df <- merge(df1, df2,by = "ID")

# Sum and clean common variables left in merged dataframe
library(dplyr)

for (variable in common){
  # Create a summed variable
  df[[variable]] <- df %>% select(starts_with(paste0(variable,"."))) %>% rowSums()
  # Delete columns with .x and .y suffixes
  df <- df %>% select(-one_of(c(paste0(variable,".x"), paste0(variable,".y"))))
}

df
   ID nasal sputum yellow intermitt Data_number coughing_up_blood dehydration dental_abscess cough
1   1     0      0      0         0           1                 0           0              0     1
2   2     0      0      0         0           3                 0           0              0     2
3   3     0      0      0         0           6                 0           0              0     0
4   4     0      0      0         0           8                 0           0              0     0
5   5     0      0      0         0           9                 0           0              0     0
6   6     0      0      0         0          11                 0           0              0     2
7   7     0      0      0         0          12                 0           0              0     0
8   8     0      0      0         0          13                 0           0              0     0
9   9     0      0      0         0          15                 0           0              0     0
10 10     0      0      0         0          16                 0           0              0     1

R - 合并/组合具有相同名称的列，但某些数据值等于零

1 个答案: