我在一个目录中有95个单独的.csv,其中包含19个不同城市的公共员工的相关信息。数据集大多采用相同的结构,并且大多以相同的格式列出数据,但是我希望规范一些异常值。后来在我的分析中,我希望使用员工姓名进行一些匹配,因此将该列规范化是最重要的。我想要执行的操作是:
使所有员工姓名列首先命名,姓氏为最后一个(大多数数据集格式正确,没有逗号([1,1] John H. Doe)。有问题的列将是比如[1,1] Doe,John H.)
删除所有标点符号
为此,我创建了这个功能:
library(stringr)
fixNames <- function(x){
if(length(grep(",", x[1])) > 0){
names <- as.character(x)
names <- str_split(names, ",", simplify = TRUE)
newNames <- paste(names[,2], names[,1], sep = " ")
newNames <- tolower(newNames)
newNames <- gsub("[.]", "", names)
newNames <- gsub(" *\\b[[:alpha:]]{1}\\b *", " ", names)
newNames <- gsub("\\s{2,}", " ", names)
newNames <- gsub("*\\bii\\b", "", names)
newNames <- gsub("*\\biii\\b", "", names)
newNames <- gsub("*\\bjr\\b", "", names)
newNames <- trimws(newNames, which="both")
x <- newNames
} else{
names <- as.character(x)
newNames <- tolower(names)
newNames <- gsub("[.]", "", names)
newNames <- gsub(" *\\b[[:alpha:]]{1}\\b *", " ", names)
newNames <- gsub("\\s{2,}", " ", names)
newNames <- gsub("*\\bii\\b", "", names)
newNames <- gsub("*\\biii\\b", "", names)
newNames <- gsub("*\\bjr\\b", "", names)
newNames <- trimws(newNames, which="both")}
x <- newNames
}
然后我用一系列for循环实现该函数
## rename all employee.name columns and make them character vectors
for(i in 1:length(files)){
ifelse(exists(files[i]) == TRUE, dinger <- get(files[i], envir = .GlobalEnv), dinger <- data.frame(matrix(NA, nrow = 0, ncol = 11)))
dinger[,1] <- as.character(dinger[,1])
colnames(dinger) <- c("employee.name","job.title", "base.pay", "overtime.pay", "other.pay", "total.benefits", "total.pay","total.pay.benefits", "year", "notes", "jurisdiction.name")
dinger <- dinger[,c(1:11)]
assign(files[i], dinger)
}
## fix names
for(i in 1:length(files)){
index <- files[i]
dinger <- get(index, envir = .GlobalEnv)
names <- as.data.frame(as.character(dinger$employee.name))
dinger$employee.name <- apply(names, 2 fixNames)
assign(files[i], dinger)
}
当我这样做时,它会抛出这样的错误:
$<-.data.frame
中的错误(*tmp*
,“employee.name”,值= c(“Till”,:
替换有384行,数据有192
现在我知道这意味着在函数拆分名称并重新排序后,它们不会将它们重新组合在一起。问题是,为什么?它与for循环有关吗?