R - 如何根据与另一个数据帧的部分匹配对数据帧中的每一行进行分类?

时间:2018-05-07 21:44:13

标签: r dataframe stringr

我有两个数据帧(df1和df2),这里是df1:

SAMPLE NAMES
1_a
1_b
1_c
2_a
2_b
3_a
4_a
4_b

这是df2:

ID  GROUP   
1   X
2   X
3   Y
4   Z

这是我想要做的 - 我想在df1中添加一个新列,它将根据与df2的ID列的部分匹配来指示样本组。因此,来自df1的样本“2_a”和“2_b”应该与df2中的“2”具有相同的组。

期望的输出:

SAMPLE NAMES    GROUP
1_a             X
1_b             X
1_c             X
2_a             X
2_b             X
3_a             Y
4_a             Z
4_b             Z

到目前为止,我已尝试使用stringr包并编写for循环:

for (i in df1[, 1]){
  for (j in df2$ID){
    x <- which(str_detect(i,j))
    class <- df2[j,1]
    df1$group[i] <- class
  }
}

但它一直给我错误:

Error in UseMethod("type") : 
  no applicable method for 'type' applied to an object of class "c('integer', 'numeric')"

我做错了什么?另外,有没有办法使用apply()函数而不是循环?

3 个答案:

答案 0 :(得分:1)

这是tidyverse选项

library(tidyverse)
df1 %>% 
 separate(., col = SAMPLE.NAMES, into = c('SAMPLE', 'NAMES'), sep = "_", convert = TRUE) %>% 
 left_join(df2, by = c('SAMPLE' = 'ID')) %>% 
 unite(., col = SAMPLE.NAMES, SAMPLE, NAMES)
#  SAMPLE.NAMES GROUP
#1          1_a     X
#2          1_b     X
#3          1_c     X
#4          2_a     X
#5          2_b     X
#6          3_a     Y
#7          4_a     Z
#8          4_b     Z

我们首先separate df1的{​​{1}}列left_join,这样我们可以df1 df2 unite通过'SAMPLE'和'身份'。在最后一行,我们df1 <- structure(list(SAMPLE.NAMES = structure(1:8, .Label = c("1_a", "1_b", "1_c", "2_a", "2_b", "3_a", "4_a", "4_b"), class = "factor")), .Names = "SAMPLE.NAMES", class = "data.frame", row.names = c(NA, -8L)) df2 <- structure(list(ID = 1:4, GROUP = structure(c(1L, 1L, 2L, 3L), .Label = c("X", "Y", "Z"), class = "factor")), .Names = c("ID", "GROUP"), class = "data.frame", row.names = c(NA, -4L)) 列'SAMPLE'和'NAME'回到'SAMPLE.NAMES'。

数据

let unorderedComments = try? json.arrayOf("comments", type: Comment.self)

答案 1 :(得分:0)

你的for循环不起作用的主要原因是str_detect()只接受字符串作为输入,但是你试图在df2中的ID列上使用它,这是一个数字向量。你的for循环还有其他问题:特别是你定义了一个之后从未实际使用的对象x,因此你的代码不会使用str_detect()选择你想要的元素。 / p>

如果您想要更多stringr解决方案,这是另一种选择。它既不使用for循环也不使用apply()(至少不是直接使用)。

它的工作原理是使用正则表达式从“SAMPLE.NAMES”列中仅提取数字字符,以将每个样本链接到其数字ID。之后,我们只需将数据框连接在一起,然后选择所需的列:

# Example dataframes
df1 <- tibble(SAMPLE.NAMES = c("1_a", "1_b", "1_c", "2_a", "2_b", "3_a", "4_a", "4_b"))
df2 <- tibble(ID = c(1,2,3,4),
              GROUP = c("X", "X", "Y", "Z"))

df1 <- mutate(df1, ID = as.numeric(str_replace_all(SAMPLE.NAMES, "_[abc]", ""))) %>%
       left_join(df2) %>%
       select(-ID)

# Output:
# A tibble: 8 x 2
  SAMPLE.NAMES GROUP
  <chr>        <chr>
1 1_a          X    
2 1_b          X    
3 1_c          X    
4 2_a          X    
5 2_b          X    
6 3_a          Y    
7 4_a          Z    
8 4_b          Z  

答案 2 :(得分:0)

只需在下划线之前合并字符串部分:

;WITH CTE AS(
    SELECT * 
    FROM(
        SELECT a.Store,a.NumItems,b.Day, a.NumItems- SUM(b.ItemsSold) 
                                            OVER(PARTITION BY B.Store  
                                                    order by case when b.day='Monday'    then 1
                                                                  when b.day='Tuesday'   then 2
                                                                  when b.day='Wednesday' then 3
                                                                  when b.day='Thursday'  then 4
                                                                  when b.day='Friday'    then 5
                                                                  when b.day='Saturday'  then 6
                                                                  when b.day='Sunday'    then 7
                                                end
                                            ROWS BETWEEN UNBOUNDED PRECEDING  AND CURRENT ROW) diffVal
        FROM A 
        INNER JOIN B ON A.Store = B.Store
    )t
)
SELECT t.Store,t.Day 
FROM CTE t
WHERE exists (SELECT MAX(diffVal) FROM CTE WHERE diffVal < 0 GROUP BY Store HAVING t.diffVal = MAX(diffVal))