我正在尝试使用需要大量清理的数据集。我有一个主题名称,我似乎无法删除前导空格。
示例数据:
Data <- dput(Data)
structure(list(Teacher = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("Please.rate.teacher:.JOHN.DOE .Overall.rating.for.teacher",
"Please.rate.teacher: Jane.Doe.Overall.rating.for.teacher"), class = "factor"),
Overall_Rating = c(5L, 4L, 5L, 4L, 4L, 5L, 4L, 4L, 4L, 4L,
3L, 5L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 3L)), .Names = c("Teacher",
"Overall_Rating"), class = "data.frame", row.names = c(NA, -22L
))
我的清洁尝试:
Data_clean <- Data %>%
mutate(Teacher = as.character(Teacher),
Teacher = gsub("Please.rate.teacher|.Overall.rating.for.teacher|[:]", "", Teacher),
Teacher = gsub("[.]", " ", Teacher),
Teacher = trimws(Teacher),
Teacher = tolower(Teacher), Teacher = tools::toTitleCase(Teacher))
导致剩余的前导和尾随空格结果,这也打破了第二个名称的标题大小写:
unique(Data_clean$Teacher)
[1] "John Doe " " jane Doe"
第一个名称仍有尾随空格,第二个名称带有前导空格。
如何删除?
答案 0 :(得分:1)
以下是一个完全可重复的示例stringr
和str_trim
,因为我不知道trimws
为什么不适合你。您发布的代码为我提供了相同的输出,正确地将案例更改为标题并删除了空格。
data <- structure(list(Teacher = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("Please.rate.teacher:.JOHN.DOE .Overall.rating.for.teacher",
"Please.rate.teacher: Jane.Doe.Overall.rating.for.teacher"), class = "factor"),
Overall_Rating = c(5L, 4L, 5L, 4L, 4L, 5L, 4L, 4L, 4L, 4L,
3L, 5L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 3L)), .Names = c("Teacher",
"Overall_Rating"), class = "data.frame", row.names = c(NA, -22L
))
library(tidyverse)
data %>%
mutate(
Teacher = Teacher %>%
str_remove_all("Please.rate.teacher:|.Overall.rating.for.teacher") %>%
str_replace_all("\\.", " ") %>%
str_trim() %>%
str_to_title()
) %>%
`[[`(1) %>%
unique()
#> [1] "John Doe" "Jane Doe"
由reprex package(v0.2.0)创建于2018-03-15。
答案 1 :(得分:1)
我怀疑您的数据包含非{ASCII}空格,如"\u00A0"
。 trimws
函数只会删除ASCII空格字符。
尝试运行utf8::utf8_print(unique(Data_clean$Teacher), utf8 = FALSE)
以查看是否属于这种情况。
要处理非ASCII空格,请使用
替换代码中的trimws(x)
gsub("(^[[:space:]]*)|([[:space:]]*$)", "", x)
答案 2 :(得分:0)
这个怎么样?
Data_clean <- Data %>%
mutate(Teacher = gsub("Please.rate.teacher|\\s*\\.Overall.rating.for.teacher|:", "", Teacher),
Teacher = gsub("\\.", " ", Teacher),
Teacher = trimws(Teacher),
Teacher = tolower(Teacher), Teacher = tools::toTitleCase(Teacher))
unique(Data_clean$Teacher);
#[1] "John Doe" "Jane Doe"
说明:替换>=0
中".Overall.rating..."
之前发生的可选(Teacher
)空格。
Data <- structure(list(Teacher = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("Please.rate.teacher:.JOHN.DOE .Overall.rating.for.teacher",
"Please.rate.teacher: Jane.Doe.Overall.rating.for.teacher"), class = "factor"),
Overall_Rating = c(5L, 4L, 5L, 4L, 4L, 5L, 4L, 4L, 4L, 4L,
3L, 5L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 3L)), .Names = c("Teacher",
"Overall_Rating"), class = "data.frame", row.names = c(NA, -22L
))