根据另一个数据框排列一个数据框中的字符串

时间:2016-08-14 19:01:16

标签: r

我有一个像这样的数据框

df1<- structure(list(V1 = structure(c(8L, 4L, 5L, 7L, 6L, 3L, 9L, 1L, 
2L), .Label = c("A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4", "A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920", 
"C1P641;C1P640;A0A061AD21;G5EEV6", "O16276", "O16520-2", "O17323-2", 
"O17395", "O17403", "Q22501;A0A061AE05"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA, 
-9L))

我的第二个数据看起来像这样

df2<- structure(list(From = structure(c(12L, 10L, 11L, 8L, 7L, 1L, 
9L, 15L, 2L, 5L, 13L, 3L, 16L, 6L, 4L, 14L), .Label = c("A0A061AD21", 
"A0A061AE05", "A0A061AJ82", "A0A061AJK8", "A0A061AKW6", "A0A061AL89", 
"C1P640", "C1P641", "G5EEV6", "O16276", "O17395", "O17403", "Q19219", 
"Q21920", "Q22501", "Q7JLR4"), class = "factor"), To = structure(c(4L, 
8L, 1L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 3L, 3L, 7L), .Label = c("aat-3", 
"CELE_F08G5.3", "CELE_R11A8.7", "cpsf-2", "epi-1", "pps-1", "R11A8.7", 
"ugt-61"), class = "factor")), .Names = c("From", "To"), class = "data.frame", row.names = c(NA, 
-16L))

df2取自df1,但添加了一些信息,一些信息被删除。我想重建像df1这样的df2,并根据

排列名为To的列

所以输出应该是这样的

From                                             To
O17403                                          cpsf-2
O16276                                          ugt-61
O16520-2                                          -
O17395                                          aat-3
O17323-2                                          -
C1P641;C1P640;A0A061AD21;G5EEV6                  epi-1
Q22501;A0A061AE05                                pps-1
A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4              CELE_F08G5.3
A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920   CELE_R11A8.7; R11AB.7

这意味着我们在df2中有O17403并且在df1中只有一个字符串,所以它保持不变。 O16276只是df1中raw中的一个字符串,所以它也保持不变 O16520-2在df1中不在df2中,因此在以连字符命名的列中 其余的相同,直到C1P641; C1P640; A0A061AD21; G5EEV6都在df1的同一行,它们的To是相同的,所以我们把它们与df1相同,只需添加一个epi-1

可能最好的方法是将df1作为模板,然后将To解析为df2,解析它们的To,那些不仅仅是连字符

这很复杂,我甚至想不出怎么做。我会感激任何帮助

2 个答案:

答案 0 :(得分:1)

为了解决这个问题,我拆分了分号分隔的字符串并创建了一个嵌套的for-for-if-if循环。

这里是循环背后的逻辑,它针对拆分字符串的data.frame(tmp)运行:

  1. 修复数据类(即将因子更改为字符以避免冲突的级别集)并将To列附加到tmp

  2. 对于tmp的每个列和行,首先查看单元格是否包含匹配的有效字符串以及df2$To中的匹配值,如果不是,请转到下一次迭代

  3. 如果有,请查看To df2中的匹配值,检查tmp$To中是否已有匹配值(如果是,请转到下一次迭代)

  4. 如果在df2$To中有新的匹配值,则将其放在tmp$To的对应单元格中,如果不是第一个匹配则将其添加到任何先前的匹配和分号之前那一行

    df1$V1   <- as.character(df1$V1)
    df2$From <- as.character(df2$From)
    df2$To   <- as.character(df2$To)
    
    library(stringr)
    tmp <- as.data.frame(str_split_fixed(df1$V1, ";",n=5), stringsAsFactors = F)
    
    tmp$To <- as.character(NA)
    for(j in 1:nrow(tmp)){
      for(i in 1:ncol(tmp)){
        if(length(df2$To[df2$From == tmp[j,i]]) == 0 | is.null(tmp[j,i])){
          next
        } else if(length(df2$To[df2$From == tmp[j,i]] ) == 1 & !is.na(tmp[j,i])){
            if(is.na(tmp$To[j]) | tmp$To[j] == df2$To[df2$From == tmp[j,i]]){
              tmp$To[j] <- df2$To[df2$From == tmp[j,i] ]
            } else{
              tmp$To[j] <- paste(tmp$To[j],";",df2$To[df2$From == tmp[j,i] ], sep="")
            }
        } else{
          next
        }
      }
    }
    
    df1 <- data.frame(From=df1$V1, To=tmp$To)
    df1
    
                                                From                   To
    1                                         O17403               cpsf-2
    2                                         O16276               ugt-61
    3                                       O16520-2                 <NA>
    4                                         O17395                aat-3
    5                                       O17323-2                 <NA>
    6                C1P641;C1P640;A0A061AD21;G5EEV6                epi-1
    7                              Q22501;A0A061AE05                pps-1
    8            A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4         CELE_F08G5.3
    9 A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7;R11A8.7
    

答案 1 :(得分:1)

这样做的一种方法是使用splitstackshape包(使用cSplit)。我将因子转换为字符串以简化(并消除警告)。

library(dplyr)
library(data.table)      # cSplit from 'splitstackshape' returns a 'data.table'.
library(splitstackshape)

### Remove the factors for convenience of manipulation
df1 <- df1 %>% mutate(From = as.character(V1))
df2 <- df2 %>% mutate(From = as.character(From), To = as.character(To))

### 'cSplit' will split on ';' and create a new row for each item. The
### original 'From' column is kept around as cSplit removes the split column.
### 'rn' (row number) is used for ordering later.
cSplit(df1 %>% mutate(rn = row_number(), From_temp = From),
       "From_temp", sep = ";", direction = "long", drop = FALSE, type.convert = FALSE) %>%
    left_join(df2, by = c(From_temp = 'From')) %>% # Join to 'df2' to get the 'To' column
    group_by(From, rn)                         %>% # Group by original 'From' column.
    summarise(To = paste(sort(unique(na.omit(To))), collapse = ';'), # Create 'To' by joining 'To' Values
              To = ifelse(To=='', '-', To))    %>% # Set empty values to '-'
    ungroup                                    %>%
    arrange(rn)                                %>% # Sort by original row number and
    select(-rn)                                    # remove 'rn' column.

##                                             From                   To
##                                            <chr>                <chr>
## 1                                         O17403               cpsf-2
## 2                                         O16276               ugt-61
## 3                                       O16520-2                    -
## 4                                         O17395                aat-3
## 5                                       O17323-2                    -
## 6                C1P641;C1P640;A0A061AD21;G5EEV6                epi-1
## 7                              Q22501;A0A061AE05                pps-1
## 8            A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4         CELE_F08G5.3
## 9 A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7;R11A8.7

dplyr可能有一种更简洁的方法,不需要splitstackshape