我有一个像这样的数据框
df1<- structure(list(V1 = structure(c(8L, 4L, 5L, 7L, 6L, 3L, 9L, 1L,
2L), .Label = c("A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4", "A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920",
"C1P641;C1P640;A0A061AD21;G5EEV6", "O16276", "O16520-2", "O17323-2",
"O17395", "O17403", "Q22501;A0A061AE05"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-9L))
我的第二个数据看起来像这样
df2<- structure(list(From = structure(c(12L, 10L, 11L, 8L, 7L, 1L,
9L, 15L, 2L, 5L, 13L, 3L, 16L, 6L, 4L, 14L), .Label = c("A0A061AD21",
"A0A061AE05", "A0A061AJ82", "A0A061AJK8", "A0A061AKW6", "A0A061AL89",
"C1P640", "C1P641", "G5EEV6", "O16276", "O17395", "O17403", "Q19219",
"Q21920", "Q22501", "Q7JLR4"), class = "factor"), To = structure(c(4L,
8L, 1L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 3L, 3L, 7L), .Label = c("aat-3",
"CELE_F08G5.3", "CELE_R11A8.7", "cpsf-2", "epi-1", "pps-1", "R11A8.7",
"ugt-61"), class = "factor")), .Names = c("From", "To"), class = "data.frame", row.names = c(NA,
-16L))
df2取自df1,但添加了一些信息,一些信息被删除。我想重建像df1这样的df2,并根据
排列名为To的列所以输出应该是这样的
From To
O17403 cpsf-2
O16276 ugt-61
O16520-2 -
O17395 aat-3
O17323-2 -
C1P641;C1P640;A0A061AD21;G5EEV6 epi-1
Q22501;A0A061AE05 pps-1
A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3
A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7; R11AB.7
这意味着我们在df2中有O17403并且在df1中只有一个字符串,所以它保持不变。 O16276只是df1中raw中的一个字符串,所以它也保持不变 O16520-2在df1中不在df2中,因此在以连字符命名的列中 其余的相同,直到C1P641; C1P640; A0A061AD21; G5EEV6都在df1的同一行,它们的To是相同的,所以我们把它们与df1相同,只需添加一个epi-1
可能最好的方法是将df1作为模板,然后将To解析为df2,解析它们的To,那些不仅仅是连字符
这很复杂,我甚至想不出怎么做。我会感激任何帮助
答案 0 :(得分:1)
为了解决这个问题,我拆分了分号分隔的字符串并创建了一个嵌套的for-for-if-if循环。
这里是循环背后的逻辑,它针对拆分字符串的data.frame(tmp
)运行:
修复数据类(即将因子更改为字符以避免冲突的级别集)并将To
列附加到tmp
对于tmp
的每个列和行,首先查看单元格是否包含匹配的有效字符串以及df2$To
中的匹配值,如果不是,请转到下一次迭代
如果有,请查看To
df2
中的匹配值,检查tmp$To
中是否已有匹配值(如果是,请转到下一次迭代)
如果在df2$To
中有新的匹配值,则将其放在tmp$To
的对应单元格中,如果不是第一个匹配则将其添加到任何先前的匹配和分号之前那一行
df1$V1 <- as.character(df1$V1)
df2$From <- as.character(df2$From)
df2$To <- as.character(df2$To)
library(stringr)
tmp <- as.data.frame(str_split_fixed(df1$V1, ";",n=5), stringsAsFactors = F)
tmp$To <- as.character(NA)
for(j in 1:nrow(tmp)){
for(i in 1:ncol(tmp)){
if(length(df2$To[df2$From == tmp[j,i]]) == 0 | is.null(tmp[j,i])){
next
} else if(length(df2$To[df2$From == tmp[j,i]] ) == 1 & !is.na(tmp[j,i])){
if(is.na(tmp$To[j]) | tmp$To[j] == df2$To[df2$From == tmp[j,i]]){
tmp$To[j] <- df2$To[df2$From == tmp[j,i] ]
} else{
tmp$To[j] <- paste(tmp$To[j],";",df2$To[df2$From == tmp[j,i] ], sep="")
}
} else{
next
}
}
}
df1 <- data.frame(From=df1$V1, To=tmp$To)
df1
From To 1 O17403 cpsf-2 2 O16276 ugt-61 3 O16520-2 <NA> 4 O17395 aat-3 5 O17323-2 <NA> 6 C1P641;C1P640;A0A061AD21;G5EEV6 epi-1 7 Q22501;A0A061AE05 pps-1 8 A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3 9 A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7;R11A8.7
答案 1 :(得分:1)
这样做的一种方法是使用splitstackshape
包(使用cSplit
)。我将因子转换为字符串以简化(并消除警告)。
library(dplyr)
library(data.table) # cSplit from 'splitstackshape' returns a 'data.table'.
library(splitstackshape)
### Remove the factors for convenience of manipulation
df1 <- df1 %>% mutate(From = as.character(V1))
df2 <- df2 %>% mutate(From = as.character(From), To = as.character(To))
### 'cSplit' will split on ';' and create a new row for each item. The
### original 'From' column is kept around as cSplit removes the split column.
### 'rn' (row number) is used for ordering later.
cSplit(df1 %>% mutate(rn = row_number(), From_temp = From),
"From_temp", sep = ";", direction = "long", drop = FALSE, type.convert = FALSE) %>%
left_join(df2, by = c(From_temp = 'From')) %>% # Join to 'df2' to get the 'To' column
group_by(From, rn) %>% # Group by original 'From' column.
summarise(To = paste(sort(unique(na.omit(To))), collapse = ';'), # Create 'To' by joining 'To' Values
To = ifelse(To=='', '-', To)) %>% # Set empty values to '-'
ungroup %>%
arrange(rn) %>% # Sort by original row number and
select(-rn) # remove 'rn' column.
## From To
## <chr> <chr>
## 1 O17403 cpsf-2
## 2 O16276 ugt-61
## 3 O16520-2 -
## 4 O17395 aat-3
## 5 O17323-2 -
## 6 C1P641;C1P640;A0A061AD21;G5EEV6 epi-1
## 7 Q22501;A0A061AE05 pps-1
## 8 A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3
## 9 A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7;R11A8.7
dplyr
可能有一种更简洁的方法,不需要splitstackshape
。