从R中的名称中提取数量元素

时间:2018-06-29 07:58:18

标签: r stringr

我有一个数据集,其中的列名称之一是“名称”,其中包含产品名称,包括产品的数量(尺寸),如下所示。

Alkabeer Paratha Plain 400 GM
Almarai Fresh Laban Baladi 2 L
Americana Breaded Chicken Burger 1 KG 
Dac Glass Cleaner 4 L
Duru Body Soap Fruity 125 GM - 4 Pcs
Lux Liquid Handwash Soft Touch 250 ML
Lux Liquid Handwash Magical Beauty 250 ML
Lusine Sliced Bread Multi Grain 600 GM
Orinex Containers Bowl 25 Oz - 4 Pcs
Betty Crocker Frosting Vanilla 400 GM
Freshly Microwave Popcorn 3.5 Oz
Gandour Potato Chips 145 Gm  
Galaxy Chocolate Milk 40 GM
Nahool Jumbo Roll Strawberry 75 GM - 6 Pcs
Nestle Sweetened Condensed Milk 397 GM 
Puck Cheese Triangle Value Pack 120 GM - 5 Pcs
Betty Crocker Super Moist Cake Mix Choco Fudge 500 GM

某些产品包装在板条箱中,例如“ Duru香皂果味125 GM-4件”

我想提取箱子的数量和大小(如果不是箱子则为0)。

数量由GM,KG,ML,L,Oz定义,箱子的尺寸由Pcs确定

编辑:

我想添加更多示例,这些示例使Onyambu提到的过程变得复杂。

 Signal Complete8 Actions White Toothpaste 120Ml 
 Fresh Plums Red Per KG
 Blemil Plus Baby Milk #2 800 GM
 7Up Drink Can 330 ML
 Lipton Chai Latte 3 In 1 Classic 25.7 Gm - 7 Pcs
 Lusine 6 Burger Buns Plain 400 GM
 Farleys Baby Food 3 Fruits 120 GM
 Clorox Regular + 40% Extra 3.7 L
 Clorox 5 In 1 Disinfectant Cleaner Orange 3 L
 Almarai Cheese 6 Portions 108 GM - 2+1 Pcs
 3 Cow Feta Cheese Low Salt 200 GM
 S-26 Pro Gold Baby Milk #1 900 GM

1 个答案:

答案 0 :(得分:1)

 library(tidyverse)
 dat%>%mutate(s=gsub(".*?(\\d+.*)","\\1",V1))%>%
   separate(s,c("quantity","crate_size")," - ",fill="right")%>%
   replace_na(list(crate_size=0))
                                                      V1 quantity crate_size
1                          Alkabeer Paratha Plain 400 GM   400 GM          0
2                         Almarai Fresh Laban Baladi 2 L      2 L          0
3                  Americana Breaded Chicken Burger 1 KG     1 KG          0
4                                  Dac Glass Cleaner 4 L      4 L          0
5                   Duru Body Soap Fruity 125 GM - 4 Pcs   125 GM      4 Pcs
6                  Lux Liquid Handwash Soft Touch 250 ML   250 ML          0
7              Lux Liquid Handwash Magical Beauty 250 ML   250 ML          0
8                 Lusine Sliced Bread Multi Grain 600 GM   600 GM          0
9                   Orinex Containers Bowl 25 Oz - 4 Pcs    25 Oz      4 Pcs
10                 Betty Crocker Frosting Vanilla 400 GM   400 GM          0
11                      Freshly Microwave Popcorn 3.5 Oz   3.5 Oz          0
12                           Gandour Potato Chips 145 Gm   145 Gm          0
13                           Galaxy Chocolate Milk 40 GM    40 GM          0
14            Nahool Jumbo Roll Strawberry 75 GM - 6 Pcs    75 GM      6 Pcs
15                Nestle Sweetened Condensed Milk 397 GM   397 GM          0
16        Puck Cheese Triangle Value Pack 120 GM - 5 Pcs   120 GM      5 Pcs
17 Betty Crocker Super Moist Cake Mix Choco Fudge 500 GM   500 GM          0

在Base R中执行此操作:

read.table(sep="-",text=gsub(".*?(\\d+.*)","\\1",dat$V1),fill=T,h=F,
      col.names = c("Quantity","Crate_Size"),na.strings = "",strip.white = T)
   Quantity Crate_Size
1    400 GM       <NA>
2       2 L       <NA>
3      1 KG       <NA>
4       4 L       <NA>
5    125 GM      4 Pcs
6    250 ML       <NA>
7    250 ML       <NA>
8    600 GM       <NA>
9     25 Oz      4 Pcs
10   400 GM       <NA>
11   3.5 Oz       <NA>
12   145 Gm       <NA>
13    40 GM       <NA>
14    75 GM      6 Pcs
15   397 GM       <NA>
16   120 GM      5 Pcs
17   500 GM       <NA>