我有很长的字符串列表,例如这个机器可读的例子:
A <- list(c("Biology","Cell Biology","Art","Humanities, Multidisciplinary; Psychology, Experimental","Astronomy & Astrophysics; Physics, Particles & Fields","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods","Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science","Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability"))
所以它看起来像这样:
> A
[[1]]
[1] "Biology"
[2] "Cell Biology"
[3] "Art"
[4] "Humanities, Multidisciplinary; Psychology, Experimental"
[5] "Astronomy & Astrophysics; Physics, Particles & Fields"
[6] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods"
[7] "Geriatrics & Gerontology"
[8] "Gerontology"
[9] "Management"
[10] "Operations Research & Management Science"
[11] "Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic"
[12] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability"
我想编辑这些术语并消除重复项以获得此结果:
[1] "Science"
[2] "Science"
[3] "Arts & Humanities"
[4] "Arts & Humanities; Social Sciences"
[5] "Science"
[6] "Social Sciences; Science"
[7] "Science"
[8] "Social Sciences"
[9] "Social Sciences"
[10] "Science"
[11] "Science"
[12] "Social Sciences; Science"
到目前为止,我只得到了这个:
stringedit <- function(A)
{
A <-gsub("Biology", "Science", A)
A <-gsub("Cell Biology", "Science", A)
A <-gsub("Art", "Arts & Humanities", A)
A <-gsub("Humanities, Multidisciplinary", "Arts & Humanities", A)
A <-gsub("Psychology, Experimental", "Social Sciences", A)
A <-gsub("Astronomy & Astrophysics", "Science", A)
A <-gsub("Physics, Particles & Fields", "Science", A)
A <-gsub("Economics", "Social Sciences", A)
A <-gsub("Mathematics", "Science", A)
A <-gsub("Mathematics, Applied", "Science", A)
A <-gsub("Mathematics, Interdisciplinary Applications", "Science", A)
A <-gsub("Social Sciences, Mathematical Methods", "Social Sciences", A)
A <-gsub("Geriatrics & Gerontology", "Science", A)
A <-gsub("Gerontology", "Social Sciences", A)
A <-gsub("Management", "Social Sciences", A)
A <-gsub("Operations Research & Management Science", "Science", A)
A <-gsub("Computer Science, Artificial Intelligence", "Science", A)
A <-gsub("Computer Science, Information Systems", "Science", A)
A <-gsub("Engineering, Electrical & Electronic", "Science", A)
A <-gsub("Statistics & Probability", "Science", A)
}
B <- lapply(A, stringedit)
但它无法正常工作:
> B
[[1]]
[1] "Science"
[2] "Cell Science"
[3] "Arts & Humanities"
[4] "Arts & Humanities; Social Sciences"
[5] "Science; Science"
[6] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences"
[7] "Science"
[8] "Social Sciences"
[9] "Social Sciences"
[10] "Operations Research & Social Sciences Science"
[11] "Computer Science, Arts & Humanitiesificial Intelligence; Science; Science"
[12] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences; Science"
如何实现上述正确的输出?
非常感谢您提前考虑!
答案 0 :(得分:5)
我发现最简单的方法是使用两列data.frame
作为查找,其中一列用于课程名称,一列用于该类别。这是一个例子:
course.categories <- data.frame(
Course =
c("Art", "Humanities, Multidisciplinary", "Biology", "Cell Biology",
"Astronomy & Astrophysics", "Physics, Particles & Fields", "Mathematics",
"Mathematics, Applied", "Mathematics, Interdisciplinary Applications",
"Geriatrics & Gerontology", "Operations Research & Management Science",
"Computer Science, Artificial Intelligence",
"Computer Science, Information Systems",
"Engineering, Electrical & Electronic", "Statistics & Probability",
"Psychology, Experimental", "Economics",
"Social Sciences, Mathematical Methods",
"Gerontology", "Management"),
Category =
c("Arts & Humanities", "Arts & Humanities", "Science", "Science",
"Science", "Science", "Science", "Science", "Science", "Science",
"Science", "Science", "Science", "Science", "Science", "Social Sciences",
"Social Sciences", "Social Sciences", "Social Sciences", "Social Sciences"))
然后,假设A
为您问题中的列表:
sapply(strsplit(unlist(A), "; "),
function(x)
paste(unique(course.categories[match(x, course.categories[["Course"]]),
"Category"]),
collapse = "; "))
# [1] "Science" "Science"
# [3] "Arts & Humanities" "Arts & Humanities; Social Sciences"
# [5] "Science" "Social Sciences; Science"
# [7] "Science" "Social Sciences"
# [9] "Social Sciences" "Science"
# [11] "Science" "Social Sciences; Science"
match
将A
中的值与course.categories
数据集中的课程名称相匹配,并说明匹配发生在哪些行上;这用于提取课程所属的类别。然后,unique
确保我们只有每个类别中的一个。 paste
将事情重新组合在一起。
答案 1 :(得分:4)
让我先从一个例子开始。你有一个字符串“细胞生物学”。第一个替换A <-gsub("Biology", "Science", A)
将其变成“细胞科学”。然后没有替代。
由于你不使用正则表达式,我宁愿使用一种哈希来做替换:
myhash <- c( "Science", "Science", "Arts & Humanities", "Arts & Humanities", "Social Sciences",
"Science", "Science", "Social Sciences", "Science", "Science", "Science", "Social Sciences",
"Science", "Social Sciences", "Social Sciences", "Science", "Science", "Science", "Science",
"Science" )
names( myhash ) <- c( "Biology", "Cell Biology", "Art", "Humanities, Multidisciplinary",
"Psychology, Experimental", "Astronomy & Astrophysics", "Physics, Particles & Fields", "Economics",
"Mathematics", "Mathematics, Applied", "Mathematics, Interdisciplinary Applications",
"Social Sciences, Mathematical Methods", "Geriatrics & Gerontology", "Gerontology", "Management",
"Operations Research & Management Science", "Computer Science, Artificial Intelligence",
"Computer Science, Information Systems", "Engineering, Electrical & Electronic",
"Statistics & Probability" )
现在,给定“生物学”等字符串,您可以快速查找您的类别:
myhash[ "Biology" ]
我不确定你为什么要使用列表而不是字符串向量,因此我会简化你的情况:
A <- c("Biology","Cell Biology","Art",
"Humanities, Multidisciplinary; Psychology, Experimental",
"Astronomy & Astrophysics; Physics, Particles & Fields",
"Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods",
"Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science",
"Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic",
"Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability")
has查找不适用于复合字符串(包含“;”)。您可以使用strsplit
拆分它们。然后,您可以使用unique
来避免重复,并使用paste
函数将其重新组合在一起。
stringedit <- function( x ) {
# first, split into subterms
a.all <- unlist( strsplit( x, "; *" ) ) ;
paste( unique( myhash[ a.all ] ), collapse= "; " )
}
unlist( lapply( A, stringedit ) )
根据需要,结果如下:
[1] "Science" "Science" "Arts & Humanities" "Arts & Humanities; Social Sciences"
[5] "Science" "Social Sciences; Science" "Science" "Social Sciences"
[9] "Social Sciences" "Science" "Science" "Social Sciences; Science"
当然,您可以多次拨打*apply
:
a.spl <- sapply( A, strsplit, "; *" )
a.spl <- sapply( a.spl, function( x ) myhash[ x ] )
unlist( sapply( a.spl, collapse, "; " )
这比以前的代码效率更高或更低。
是的,您可以使用正则表达式实现相同的功能,但首先,它会涉及拆分字符串,然后使用正则表达式^Biology$
来确保它们匹配“生物学”而不是“细胞生物学”除非你想要像“。*生物学”这样的结构。最后,你无论如何都要摆脱重复,而且所有这一切,在我看来(i)不那么冗长(更容易出错)和(ii)不值得努力。
答案 2 :(得分:2)
如何使用switch
?
science.category <- function(science){
switch(science,
"Biology" =,
"Cell Biology" =,
"Astronomy & Astrophysics" =,
"Physics, Particles & Fields" =,
"Mathematics" =,
"Mathematics, Applied" =,
"Mathematics, Interdisciplinary Applications" =,
"Geriatrics & Gerontology" =,
"Operations Research & Management Science" =,
"Computer Science, Artificial Intelligence" =,
"Computer Science, Information Systems" =,
"Engineering, Electrical & Electronic" =,
"Statistics & Probability" = "Science",
"Art" =,
"Humanities, Multidisciplinary" = "Arts & Humanities",
"Psychology, Experimental" =,
"Economics" =,
"Social Sciences, Mathematical Methods" =,
"Gerontology" =,
"Management" = "Social Sciences",
NA
)
}
a <- unlist(lapply(A, strsplit, split = " *; *"), recursive = FALSE)
a1 <- lapply(a, function(x) unique(sapply(x, science.category)))
sapply(a1, paste, collapse = "; ")
当然,只要你有适当的字符串作为switch
参数,这将有效。一个不匹配,你将以NA
结束。对于某些高级用法,您应该编写自己的包装器以使用grep
- 函数族,甚至是agrep
(小心处理)。