我有这个问题。我的数据集 a 包含一个格式错误的列,其中包含字符,字母和标点符号。我想将 num 和 text 两列中的 Unit_Wrong 列分开。
这是数据集a:
a <- data.frame(Measure = c(10000, 2000, 10000, 15000, 40000, 0),
Unit_Wrong = c("10L","25.5mL","30.5 mL","40OUNCES","3X", "NO_SIZE"),
stringsAsFactors = FALSE)
我的预期结果是 b :
b <- data.frame(Measure = c(10000, 2000, 10000, 15000, 40000, 0),
Unit_Wrong = c("10L","25.5mL","30.5 mL","40OUNCES","3X", "NO_SIZE"),
text = c("L", "mL", "ml", "OUNCES", "X", "NO_SIZE"),
num = c("10","25.5","30.5","40","3", ""),
stringsAsFactors = FALSE)
我试过这个,但它不起作用:
attempt <- a %>%
mutate(text = gsub("[[:digit:]]","", Unit_Wrong)) %>%
mutate(num = str_replace_all(Unit_Wrong, text, ""))
你能帮忙吗?
答案 0 :(得分:3)
a %>%
mutate(text = stringr::str_extract(Unit_Wrong,"[A-z]+$")) %>%
mutate(num = stringr::str_extract(Unit_Wrong,"(\\d\\.?)+") %>% as.numeric)
输出:
Measure Unit_Wrong text num
1 10 10L L 10
2 2000 25.5mL mL 25.5
3 10000 30.5 mL mL 30.5
4 15 40OUNCES OUNCES 40
5 40 3X X 3
6 0 NO_SIZE NO_SIZE <NA>
注意:
如果你有像“μ”等单位的特殊字符,你需要添加它们
在[A-z]
内[A-zµ]
,依此类推。
答案 1 :(得分:1)
这是使用gsub
> text <- gsub("\\d*\\s*\\.*", "", a$Unit_Wrong)
> num <- as.numeric(gsub("\\s*[[A-Za-z]]*_*", "", a$Unit_Wrong))
> data.frame(a, text, num)
Measure Unit_Wrong text num
1 10000 10L L 10.0
2 2000 25.5mL mL 25.5
3 10000 30.5 mL mL 30.5
4 15000 40OUNCES OUNCES 40.0
5 40000 3X X 3.0
6 0 NO_SIZE NO_SIZE NA