我正在努力从文本字符串中创建一个新变量。以下是我的数据框的示例:
Brand Pack_Content
1 Dove 4X25 G
2 Snickers 250 G
3 Twix 2X20.7 G
4 Korkunov BULK
我想创建一个名为Grams的数字变量。我已经尝试过使用gsub或单独使用的解决方案,但需要按行划分不同的解决方案(即,有些人需要将Brand Packs与多个包装相乘(即4X25 G))让我感到难过。使用dplyr的溶液是优选的。
Brand Pack_Content Grams
1 Dove 4X25 G 100
2 Snickers 250 G 250
3 Twix 2X20.7 G 41.4
4 Korkunov BULK 1000
答案 0 :(得分:3)
更新:添加了一些单位提取和转换只是为了它的哎呀
更新2:投入了一些验证步骤(如果没有其他人的话,供我自己参考)应该是原始答案的一部分。一般来说,如果你使用正则表达式来提取值(并且你没有时间详细检查每一行输出),当一些未被考虑的角点输入格式出现时很容易被烧掉
使用data.table
,stringi
以及正则表达式的甜美,神奇:
由于正则表达式很难自行完成,我认为将重点放在使转换步骤可读且清晰定义而不是试图将其全部塞入一系列管道和少量代码行中是一种更安全的选择。可能的。
由于dplyr
不允许逐步操作(无管道)而不在每个表达式后重新分配tibble,我觉得data.table
是这种类型的更优雅和有效的工具数据改变工作。
library(data.table)
library(stringi)
DT <- data.table(Brand = c("Dove","Snickers","Twix","Korkunov","Reeses","M&M's"),
Pack = c("4X25 G","0.250 KG","2X20.7 G","BULK","2.5.5X4G","2 X 3 X 3G"))
首先,我们将剥离空格并使所有内容都为大写
## Strip out Spaces
DT[,Pack := gsub("[[:space:]]+","",Pack)]
## Make everything Uppercase
DT[,Pack := toupper(Pack)]
在我们使用正则表达式提取值并对它们进行一些数学运算之前,做一些验证步骤可能是谨慎的,以确保我们不会因为意外的极端情况而被烧毁。
## Start off by trusting nothing
DT[,Valid := FALSE]
## Mark Packs that fit formats like "BULK" as valid
DT[Pack %in% c("BULK"),Valid := TRUE]
## Mark Packs that fit formats like "4X20G" or "3.0X3KG" as valid
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+X([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"),
Valid := TRUE]
## Mark Packs that fit formats like "250G" as valid
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"),
Valid := TRUE]
print(DT)
此时:
Brand Pack Valid
1: Dove 4X25G TRUE
2: Snickers 0.250KG TRUE
3: Twix 2X20.7G TRUE
4: Korkunov BULK TRUE
5: Reeses 2.5.5X4G FALSE
6: M&M's 2X3X3G FALSE
请注意,我们只填充满足预定义格式的行的值。
## Extract the first number at the beginning of the "Pack" column followed by an X
DT[Valid == TRUE, Quantity := as.numeric(stri_extract_first_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(?=X)"))]
## Extract last number out of the "Pack" column
DT[Valid == TRUE, Unit_Weight := as.numeric(stri_extract_last_regex(Pack,"([[:digit:]]+\\.){0,1}[[:digit:]]+"))]
## Extract the Units
DT[Valid == TRUE, Units := stri_extract_last_regex(Pack,"[[:alpha:]]+$")]
print(DT)
现在我们有了以下内容:
Brand Pack Valid Quantity Unit_Weight Units
1: Dove 4X25G TRUE 4 25.00 G
2: Snickers 0.250KG TRUE NA 0.25 KG
3: Twix 2X20.7G TRUE 2 20.70 G
4: Korkunov BULK TRUE NA NA BULK
5: Reeses 2.5.5X4G FALSE NA NA NA
6: M&M's 2X3X3G FALSE NA NA NA
现在我们只需返回并填写没有重量或数量的行,可选择转换单位等,以便我们计算重量。
## Start with a standard conversion factor of 1
DT[Valid == TRUE, Unit_Factor := 1]
## Make some Unit Conversions
DT[Units == "KG", Unit_Factor := 1000]
## Fill in Rows without a quantity with a value of 1
DT[Valid == TRUE & is.na(Quantity), Quantity := 1]
## Fill in a weight for Bulk units
DT[Pack == "BULK", `:=` (Unit_Weight = 1000, Units = "G")]
## And finally, calculate Weight in grams
DT[Valid == TRUE, Grams := Unit_Weight*Quantity*Unit_Factor]
print(DT)
产生最终结果:
Brand Pack Valid Quantity Unit_Weight Units Unit_Factor Grams
1: Dove 4X25G TRUE 4 25.00 G 1 100.0
2: Snickers 0.250KG TRUE 1 0.25 KG 1000 250.0
3: Twix 2X20.7G TRUE 2 20.70 G 1 41.4
4: Korkunov BULK TRUE 1 1000.00 G 1 1000.0
5: Reeses 2.5.5X4G FALSE NA NA NA NA NA
6: M&M's 2X3X3G FALSE NA NA NA NA NA
library(data.table)
library(stringi)
DT <- data.table(Brand = c("Dove","Snickers","Twix","Korkunov","Reeses","M&M's"),
Pack = c("4X25 G","0.250 KG","2X20.7 G","BULK","2.5.5X4G","2 X 3 X 3G"))
DT[,Pack := gsub("[[:space:]]+","",Pack)]
DT[,Pack := toupper(Pack)]
DT[,Valid := FALSE]
DT[Pack %in% c("BULK"),Valid := TRUE]
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+X([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"), Valid := TRUE]
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"), Valid := TRUE]
DT[Valid == TRUE, Quantity := as.numeric(stri_extract_first_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(?=X)"))]
DT[Valid == TRUE, Unit_Weight := as.numeric(stri_extract_last_regex(Pack,"([[:digit:]]+\\.){0,1}[[:digit:]]+"))]
DT[Valid == TRUE, Units := stri_extract_last_regex(Pack,"[[:alpha:]]+$")]
DT[Valid == TRUE, Unit_Factor := 1]
DT[Units == "KG", Unit_Factor := 1000]
DT[Valid == TRUE & is.na(Quantity), Quantity := 1]
DT[Pack == "BULK", `:=` (Unit_Weight = 1000, Units = "G")]
DT[Valid == TRUE, Grams := Unit_Weight*Quantity*Unit_Factor]
我假设你没有包含原始数据所有地方的所有杂乱,脏的细节,所以你可能需要添加更多的步骤来捕获你有磅而不是克的情况(和所有其他角落的情况。)
尽管如此,有了5-7个正则表达式,我认为你可能至少可以覆盖相当数量的潜在案例。
我大部分时间都将this Regex cheatsheet on RStudio's website置于武器之中。
答案 1 :(得分:3)
使用dplyr和tidyr的解决方案。关键是在使用Pack_Content_new
分隔case_when
列之前,替换所有字符串,例如&#34; G&#34;或者&#34; BULK&#34;用&#34;&#34;或有意义的数字。如果您有多个有意义的字符串,例如&#34; BULK&#34;,除了recode
之外,您可能还想使用separate
。在NA
函数之后,我们可以在Number
列中将Grams
替换为1。 Finnaly,我们可以根据Number
和Unit_Weight
中的数字来计算library(dplyr)
library(tidyr)
dat2 <- dat %>%
mutate(Pack_Content_new = sub("G$", "", Pack_Content)) %>% # Remove the last G
mutate(Pack_Content_new = recode(Pack_Content_new, # Replace BULK with 1000
`BULK` = "1000")) %>%
separate(Pack_Content_new, into = c("Number", "Unit_Weight"), # Separate the Pack_Content_new column
sep = "X", convert = TRUE,
fill = "left") %>%
replace_na(list(Number = 1)) %>% # Replace NA in Number with 1
mutate(Grams = Number * Unit_Weight) # Calculate the Grams
dat2
# Brand Pack_Content Number Unit_Weight Grams
# 1 Dove 4X25 G 4 25.0 100.0
# 2 Snickers 250 G 1 250.0 250.0
# 3 Twix 2X20.7 G 2 20.7 41.4
# 4 Korkunov BULK 1 1000.0 1000.0
。
dat <- read.table(text = " Brand Pack_Content
1 Dove '4X25 G'
2 Snickers '250 G'
3 Twix '2X20.7 G'
4 Korkunov 'BULK'",
header = TRUE, stringsAsFactors = FALSE)
数据强>
{{1}}
答案 2 :(得分:1)
我知道你需要一个plyr解决方案。你尝试过Base R的所有方法吗?那么这里只是一个小的。希望这有帮助,即使它不是一种普通的方法。
首先,您需要保留数字,并将X
替换为*
。这是通过使用sub
函数完成的。我们还用1000替换不包含数字的那个。然后我们只评估获得的内容:
A=sub("X","*",sub("\\s.*","",dat$Pack_Content))
transform(dat,Grams=sapply(parse(text=replace(A,-grep("\\d",A),1000)),eval))
Brand Pack_Content Grams
1 Dove 4X25 G 100.0
2 Snickers 250 G 250.0
3 Twix 2X20.7 G 41.4
4 Korkunov BULK 1000.0
使用的数据:
dat=structure(list(Brand = c("Dove", "Snickers", "Twix", "Korkunov"
), Pack_Content = c("4X25 G", "250 G", "2X20.7 G", "BULK")), .Names = c("Brand",
"Pack_Content"), class = "data.frame", row.names = c("1", "2",
"3", "4"))