Question

我正在尝试编写一个代码来创建一个新列，该列将使用数字标记相应的化合物。我有一些列表中重复的化合物，我需要用相同的数字标记这些化合物，但需要一个字母来分隔化合物。我不知道如何编码。谢谢，例如：目前有什么：

Fructose 1
Maltose  2
Sucrose  3
Sucrose  4

想要什么：

    Fructose 1
    Maltose 2
    Sucrose 3
    Sucrose 3b

我不能手工标记每个化合物，因为我有这么大的数据集。

Answer 1

以下是您问题的解决方案：

创建自定义格式以映射{1} - ＆gt; {b}，{2} - ＆gt; {c}等。
按碳水化合物列排序您拥有的数据集。
使用区间变量 First.Carbohydrate ，如果这是碳水化合物的第一个实例，它将等于1。
使用计数器跟踪重复值的数量。
现在使用自定义格式将重复次数转换为字母数字后缀，通过put函数：put（counter，customFormat。）

您可以阅读by-group data step processing here以提高您的SAS技能。

以下是完整的工作示例：

    data have;
        length Carb $10; 
        input Carb;
        datalines;
    Fructose
    Maltose 
    Sucrose 
    Sucrose 
    Sucrose 
    Pasta   
    Pasta   
    Rice   
    Rice 
    Rice
    Quinoa
    Bread
    ;

    proc format;
        value dupFormat
        1 = 'b'
        2 = 'c'
        3 = 'd'
        ;
    run;

    proc sort data=have;
        by Carb;
    run;

    data want(keep=Carb Number);
        length Carb $10;
        length Number $3;

        set have;
        by Carb;

        /* nCarbs is the number of distinct carbs written so far */
        if _n_=1 then nCarbs = 0;  

        if first.Carb then do;
            nCarbs+1;
            count_dup = 0; /* the number of duplicate records for the current cab */
            Number = left(put(nCarbs,3.)); 
        end;
        else do;
            count_dup+1;
            Number = cats(put(nCarbs,3.), put(count_dup, dupFormat.));
        end;
    run;

    proc print data=want;
    run;

Answer 2

以下是我如何使用R和data.table包执行此操作。

首先，我们将按compound键入（并排序）数据。然后，我们将创建自己的索引并为其欺骗添加字母（虽然不确定如何处理大于26的组）

library(data.table)
setkey(setDT(df), compound)[, indx := as.character(.GRP), by = compound]
df[duplicated(df), indx := paste0(indx, letters[seq_len(.N)])]
df
#    compound number indx
# 1: Fructose      1    1
# 2:  Maltose      2    2
# 3:  Sucrose      3    3
# 4:  Sucrose      4   3a

Answer 3

使用@ jaamor的数据，您可以在基础r

中执行此操作

x <- c('Fructose','Maltose','Sucrose','Sucrose')
x <- c('Fructose','Maltose','Sucrose','Sucrose','Sucrose','Pasta',
       'Pasta','Rice','Rice','Rice','Quinoa','Bread')
y <- gsub('a', '', letters[ave(seq_along(x), x, FUN = seq_along)])

data.frame(x = x, y = paste0(cumsum(!duplicated(x)), y))

#           x  y
# 1  Fructose  1
# 2   Maltose  2
# 3   Sucrose  3
# 4   Sucrose 3b
# 5   Sucrose 3c
# 6     Pasta  4
# 7     Pasta 4b
# 8      Rice  5
# 9      Rice 5b
# 10     Rice 5c
# 11   Quinoa  6
# 12    Bread  7

在SAS或R中重命名重复变量

3 个答案: