我有两个列表(更确切地说,字符原子向量),我想使用正则表达式来比较,以生成其中一个列表的子集。我可以使用'for'循环,但是有一些更简单的代码吗?以下举例说明我的情况:
# list of unique cities
city <- c('Berlin', 'Perth', 'Oslo')
# list of city-months, like 'New York-Dec'
temp <- c('Berlin-Jan', 'Delhi-Jan', 'Lima-Feb', 'Perth-Feb', 'Oslo-Jan')
# need sub-set of 'temp' for only 'Jan' month for only the items in 'city' list:
# 'Berlin-Jan', 'Oslo-Jan'
补充说明:在我正在寻找代码的实际情况中,'月'等价物的值更复杂,而且是随机的字母数字值,只有前两个字符具有我感兴趣的信息值(必须是'01')。
添加了实际案例:
# equivalent of 'city' in the first example
# values match pattern TCGA-[0-9A-Z]{2}-[0-9A-Z]{4}
patient <- c('TCGA-43-4897', 'TCGA-65-4897', 'TCGA-78-8904', 'TCGA-90-8984')
# equivalent of 'temp' in the first example
# values match pattern TCGA-[0-9A-Z]{2}-[0-9A-Z]{4}-[\d]{2}[0-9A-Z]+
sample <- c('TCGA-21-5732-01A333', 'TCGA-43-4897-01A159', 'TCGA-65-4897-01T76', 'TCGA-78-8904-11A70')
# sub-set wanted (must have '01' after the 'patient' ID part)
# 'TCGA-43-4897-01A159', 'TCGA-65-4897-01T76'
答案 0 :(得分:4)
这样的东西?
temp <- temp[grepl("Jan", temp)]
temp[sapply(strsplit(temp, "-"), "[[", 1) %in% city]
# [1] "Berlin-Jan" "Oslo-Jan"
更好的是,从@agstudy借用这个想法:
> temp[temp %in% paste0(city, "-Jan")]
# [1] "Berlin-Jan" "Oslo-Jan"
编辑:这个怎么样?
> sample[gsub("(.*-01).*$", "\\1", sample) %in% paste0(patient, "-01")]
# [1] "TCGA-43-4897-01A159" "TCGA-65-4897-01T76"
答案 1 :(得分:3)
这是一个解决其他问题的解决方案,符合您的新要求:
sample[na.omit(pmatch(paste0(patient, '-01'), sample))]
答案 2 :(得分:2)
您可以使用gsub
x <- gsub(paste(paste(city,collapse='-Jan|'),'-Jan',sep=''),1,temp)
> temp[x==1]
[1] "Berlin-Jan" "Oslo-Jan"
这里的模式是:
"Berlin-Jan|Perth-Jan|Oslo-Jan"
答案 3 :(得分:1)
这是一个包含两个部分字符串匹配的解决方案......
temp[agrep("Jan",temp)[which(agrep("Jan",temp) %in% sapply(city, agrep, x=temp))]]
# [1] "Berlin-Jan" "Oslo-Jan"
仅仅是为了娱乐......
fun <- function(x,y,pattern) y[agrep(pattern,y)[which(agrep(pattern,y) %in% sapply(x, agrep, x=y))]]
# x is a vector containing your data for filter
# y is a vector containing the data to filter on
# pattern is the quoted pattern you're filtering on
fun(temp, city, "Jan")
# [1] "Berlin-Jan" "Oslo-Jan"