从R数据帧中删除准重复

时间:2016-08-05 15:26:31

标签: r dataframe unique

我有两列Dataframe。第一列是标识号,第二列是化合物。然而,第2栏中的化合物通常是重复的(不同形式的相同化合物)。我想删除每个副本,除了化合物的简单形式。

这是Dataframe:

>NISTSpecR

     NIST                                                     NAME
   366620                              Formic acid, TMS derivative
   366765 2-[2-(2-Butoxyethoxy)ethoxy] Acetic acid, TMS derivative
   342340                              Acetic acid, TMS derivative
   352374                           Propanoic acid, TMS derivative
   333858                             Butyric Acid, TMS derivative
   352377                           Pentanoic acid, TMS derivative
    24239                            Hexanoic acid, TMS derivative
   333733                           Heptanoic acid, TMS derivative
   352455                             Oxalic acid, 2TMS derivative
   414056                   Succinic acid, monoethyl ester-, (TMS)
   332809                              Adipic acid, TMS derivative
    30799                            Pimelic acid, 2TMS derivative
   292699                            Suberic acid, 2TMS derivative
   333874                             Citric acid, 4TMS derivative
   366657                             Citric acid, 3TMS derivative
   333513                         (-)-Epinephrine, 3TMS derivative
    16985                  Epinephrine, (.beta.)-, 3TMS derivative
    24795                    Norepinephrine, (R)-, 5TMS derivative
   332935                       DL-Norepinephrine, 4TMS derivative

这是它的结构:

> str(NISTSpecR)

'data.frame':   154 obs. of  3 variables:
 $ Spec: Factor w/ 239429 levels "1 0; 13 2; 14 27; 15 239; 16 3; 18 2; 26 3; 27 36; 28 32; 29 113; 30 9; 31 64; 32 9; 33 17; 34 17; 35 20; 36 1; 37 1; 41 8; 42 "| __truncated__,..: 23720 32791 3011 32175 12349 29069 193166 26108 28713 73845 ...
 $ NIST: chr  "366620" "366765" "342340" "352374" ...
 $ NAME: Factor w/ 239430 levels "-4'-Dimethylamino-2'-(trimethylsilyl)acetanilide",..: 157152 39442 108436 210392 133148 199151 169386 168243 195800 229235 ...

我希望最终结果看起来像这样:

>NISTSpecR

     NIST                                                     NAME
   366620                              Formic acid, TMS derivative
   342340                              Acetic acid, TMS derivative
   352374                           Propanoic acid, TMS derivative
   333858                             Butyric Acid, TMS derivative
   352377                           Pentanoic acid, TMS derivative
    24239                            Hexanoic acid, TMS derivative
   333733                           Heptanoic acid, TMS derivative
   352455                             Oxalic acid, 2TMS derivative
   414056                   Succinic acid, monoethyl ester-, (TMS)
   332809                              Adipic acid, TMS derivative
    30799                            Pimelic acid, 2TMS derivative
   292699                            Suberic acid, 2TMS derivative
   366657                             Citric acid, 3TMS derivative
   333513                         (-)-Epinephrine, 3TMS derivative
    24795                    Norepinephrine, (R)-, 5TMS derivative

每种母体化合物中只有一种(即甲酸,......)。它需要是最简单的版本(字符最少的版本)。

> dput(as.character(NISTSpecR$NAME))

c("Formic acid, TMS derivative", "2-[2-(2-Butoxyethoxy)ethoxy] Acetic acid, TMS derivative", 
"Acetic acid, TMS derivative", "Propanoic acid, TMS derivative", 
"Butyric Acid, TMS derivative", "Pentanoic acid, TMS derivative", 
"Hexanoic acid, TMS derivative", "Heptanoic acid, TMS derivative", 
"Oxalic acid, 2TMS derivative", "Succinic acid, monoethyl ester-, (TMS)", 
"Adipic acid, TMS derivative", "Pimelic acid, 2TMS derivative", 
"Suberic acid, 2TMS derivative", "Citric acid, 4TMS derivative", 
"Citric acid, 3TMS derivative", "Citric acid 3TMS", "Citric acid, ethyl ester, tri-TMS", 
"Isocitric acid lactone, 2TMS derivative", "Glyoxylic acid, di-TMS", 
"Pyruvic acid, TMS derivative", "Malic acid, 2TMS derivative", 
"Malic acid 1-ethyl ester, 2TMS", "Malic acid, 4-ethyl ester, 2TMS", 
"Malic acid, 3TMS derivative", "4-Hydroxybutanoic acid, 2TMS derivative", 
"Prostaglandin A1, 2TMS derivative", "Prostaglandin A2, 2TMS derivative", 
"Prostaglandin E2, 3TMS", "D-Arabinose, 4TMS derivative", "D-Xylose, 4TMS derivative", 
"D-Lyxose, 4TMS derivative", "D-Ribose, 4TMS derivative", "D-Glucose, 5TMS derivative", 
"D-Galactose, 5TMS derivative", "D-Mannose, 5TMS derivative", 
"D-Allose, oxime (isomer 1), 6TMS derivative", "D-Allose, oxime (isomer 2), 6TMS derivative", 
"D-Altrose, 5TMS derivative", "Dihydroxyacetone, 2TMS derivative", 
"1,3-Dihydroxyacetone dimer, 4TMS derivative", "D-Fructose, 5TMS     derivative", 

" D-Psicose,5TMS衍生物"," Sedoheptulose,6TMS衍生物",     " D-2-脱氧核糖,3TMS衍生物"," 2-脱氧核糖,3TMS衍生物",     " L-岩藻糖,4TMS衍生物"," L-鼠李糖,(R,R,S,S) - ,4TMS衍生物",     " L-鼠李糖,4TMS衍生物"," N-乙酰-D-葡糖胺,4TMS衍生物",     " D-葡萄糖酸,6TMS衍生物","甘油单硬脂酸酯,2TMS衍生物",     "甘油2-月桂酸酯,2TMS衍生物","甘油,3TMS衍生物",     "木糖醇,5TMS衍生物",#34; D-山梨糖醇,6TMS衍生物"," D-甘露醇,6TMS衍生物",     "蔗糖,8TMS衍生物"," D-乳糖,(异构体1),8TMS衍生物",     "β-D-乳糖,(异构体1),8TMS衍生物"," D-乳糖,(异构体2),8TMS衍生物",     "β-D-乳糖,(异构体2),8TMS衍生物","α-D-乳糖,8TMS衍生物",     "。-α-D-乳糖,8TMS衍生物","β-乳糖,8TMS衍生物",     "乳糖,8TMS衍生物","麦芽糖,8TMS衍生物,异构体1",     "麦芽糖,8TMS衍生物,异构体2","麦芽糖,8TMS衍生物",     " D-海藻糖,7TMS衍生物"," Melibiose,8TMS衍生物",     " L-鸟氨酸,3TMS衍生物"," DL-鸟氨酸,3TMS衍生物",     " DL-Ornithine,4TMS衍生物"," L-鸟氨酸,4TMS衍生物",     " L-高丝氨酸,2TMS衍生物"," L-瓜氨酸,3TMS衍生物",     " 3-碘-L-酪氨酸,3TMS衍生物"," 3-氨基异丁酸,TMS衍生物",     " 3-氨基异丁酸,3TMS衍生物",#34; 3-氨基异丁酸,2TMS衍生物",     " D-异亮氨酸,N-乙酰基 - ,TMS衍生物"," L-羟基脯氨酸,(E) - ,2TMS衍生物",     " L-羟脯氨酸,(E) - ,3TMS衍生物","羟脯氨酸,3TMS衍生物",     " 3-羟基脯氨酸,3TMS衍生物"," L-胱氨酸,4TMS衍生物", "乙醇胺,3TMS衍生物","乙醇胺,2TMS衍生物",      " 3-氨基丙醇,TMS衍生物","腐胺,4TMS衍生物",     "组胺,2TMS衍生物","组胺,3TMS衍生物","多巴胺,4TMS衍生物",     "多巴胺,3TMS衍生物"," 5-羟色胺,4TMS衍生物","酪胺,3TMS衍生物",     "酪胺,TMS衍生物","酪胺,2TMS衍生物","苯乙胺,2TMS衍生物",     " 1-苯乙胺,TMS衍生物","苯乙胺,TMS衍生物",     "生物素,3TMS衍生物"," 16.beta。,17.alpha.-Estriol,3TMS衍生物",     " Estriol,3TMS衍生物"," 16.alpha。,17.alpha.-Estriol,3TMS衍生物",     " 16.beta。,17.beta。-Estriol,3TMS衍生物"," Estrone,TMS衍生物",     " 16-Estrone,TMS衍生物","雌酮,O-甲基肟,TMS衍生物",     " Equilin,TMS衍生物"," Equilenin,(14.beta。) - ,TMS衍生物",     " Equilenin,TMS衍生物"," 2-羟基雌二醇,3TMS衍生物",     "雄甾酮,(E) - ,TMS衍生物","脱氢表雄酮,(E) - ,TMS衍生物",     " 5.beta.-二氢睾酮,TMS衍生物"," 5.alpha.-二氢睾酮,TMS衍生物",     "睾酮O-甲基肟,TMS衍生物","睾酮,TMS衍生物",     "孕烯醇酮,TMS衍生物","醛固酮,2TMS衍生物",     "醛固酮,N-甲氧基 - 三-TMS","皮质酮,双(O-甲基肟)",     "脱氧胆酸,2TMS衍生物","脱氧胆酸,3TMS衍生物",     "石胆酸,2TMS衍生物","胆固醇,TMS衍生物",     " Desmosterol,TMS衍生物","麦角甾醇,TMS衍生物",     " Campesterol,TMS衍生物"," Fucosterol,TMS衍生物",     " Stigmastanol,TMS衍生物"," Stigmasterol,TMS衍生物",     " 11-脱氧皮质醇,双(O-甲基肟)","褪黑激素,2TMS衍生物",     "肾上腺素,4TMS衍生物"," L-肾上腺素,4TMS衍生物",     "甘氨酸,3TMS衍生物","甘氨酸,TMS衍生物","甘氨酸,2TMS衍生物",     "天冬氨酸,3TMS衍生物"," L-天冬氨酸,3TMS衍生物",     " L-天冬氨酸,2TMS衍生物"," L-谷氨酸,3TMS衍生物",     "( - ) - 肾上腺素,3TMS衍生物","肾上腺素,(β) - ,3TMS衍生物",     "( - ) - 肾上腺素,4TMS衍生物","去甲肾上腺素,(R) - ,5TMS衍生物",     " DL-去甲肾上腺素,4TMS衍生物","去甲肾上腺素,(R) - ,4TMS衍生物",     "环丝氨酸,3TMS衍生物","环己酰亚胺,2TMS衍生物",     "氯霉素,2TMS衍生物","氯霉素,3TMS衍生物" )

谢谢。

2 个答案:

答案 0 :(得分:1)

根据您的编辑,我做了以下操作:首先,提取带有匹配后缀的文字

parents <- extract_indices <- str_split(nist, ",") %>% 
  lapply(str_extract, "[A-z][a-z]+(ine|ol|in|ose|ic|one|ide)")

然后,由于其中一些单词中包含多个逗号,请提取 将非NA值发生到列表extract_indices,并将每个列表元素中出现的索引保存到向量indvec

extract_indices <- parents %>% 
  lapply(function(x) which(!is.na(x)))
indvec <- do.call("c",extract_indices)

然后遍历父项,并为每个列表元素提取父化合物发生的向量。

answer <- sapply(seq_along(parents),
       function(i){
         parents[[i]][indvec][i]
       })

   answer

  [1] "Formic"                 "Acetic"                 "Acetic"                 "Propanoic"              "Butyric"               
  [6] "Pentanoic"              "Hexanoic"               "Heptanoic"              "Oxalic"                 "Succinic"              
 [11] "Adipic"                 "Pimelic"                "Suberic"                "Citric"                 "Citric"                
 [16] "Citric"                 "Citric"                 "Isocitric"              "Glyoxylic"              "Pyruvic"               
 [21] "Malic"                  "Malic"                  "Malic"                  "Malic"                  "Hydroxybutanoic"       
 [26] "Prostaglandin"          "Prostaglandin"          "Prostaglandin"          "Arabinose"              "Xylose"                
 [31] "Lyxose"                 "Ribose"                 "Glucose"                "Galactose"              "Mannose"               
 [36] "Allose"                 "Allose"                 "Altrose"                "Dihydroxyacetone"       "Dihydroxyacetone"      
 [41] "Fructose"               "Psicose"                "Sedoheptulose"          "Deoxyribose"            "Deoxyribose"           
 [46] "Fucose"                 "Rhamnose"               "Rhamnose"               "glucosamine"            "Gluconic"              
 [51] "Glycerol"               "Glycerol"               "Glycerol"               "Xylitol"                "Sorbitol"              

它继续这样......

现在,考虑到您只希望每个最短的字符(由最少字符计算)首先计算原始数据集中的字符,然后对于每个简短答案都有匹配,从原始数据中选择一个最短的性格。

nchar_parent <- nchar(nist)
final <- vector(mode = "character", length(nist))
for(i in seq_along(nist)){
  temp_matches <- which(match(answer,answer[i])==TRUE)
  shortest <- temp_matches[which.min(nchar_parent[temp_matches])]
  final[i] <- nist[shortest]
}

你的最终答案是这样的

[1] "Formic acid, TMS derivative"                  "Acetic acid, TMS derivative"                 
  [3] "Acetic acid, TMS derivative"                  "Propanoic acid, TMS derivative"              
  [5] "Butyric Acid, TMS derivative"                 "Pentanoic acid, TMS derivative"              
  [7] "Hexanoic acid, TMS derivative"                "Heptanoic acid, TMS derivative"              
  [9] "Oxalic acid, 2TMS derivative"                 "Succinic acid, monoethyl ester-, (TMS)"      
 [11] "Adipic acid, TMS derivative"                  "Pimelic acid, 2TMS derivative"               
 [13] "Suberic acid, 2TMS derivative"                "Citric acid 3TMS"                            
 [15] "Citric acid 3TMS"                             "Citric acid 3TMS"                            
 [17] "Citric acid 3TMS"                             "Isocitric acid lactone, 2TMS derivative"     
 [19] "Glyoxylic acid, di-TMS"                       "Pyruvic acid, TMS derivative"                
 [21] "Malic acid, 2TMS derivative"                  "Malic acid, 2TMS derivative"                 
 [23] "Malic acid, 2TMS derivative"                  "Malic acid, 2TMS derivative"       

答案 1 :(得分:0)

如果只需要第二列的第一部分(逗号之前),则可以使用将第二列划分为多列的split函数;在此操作之后,您需要此结果的第一列;之后,可以根据计算出的列删除df的重复条目;最后一条指令删除(可选)第二列的第一部分。

df$foo <- data.frame(do.call('rbind', strsplit(as.character(df$NAME),',',fixed=TRUE)))[,1]#split values
df<-df[!duplicated(df$foo),]
df<-df[,-3]