我正在尝试编写一个代码,该代码将解析包含多条信息的单个列。例如,假设我有以下数据框叫做df:
ids info
1 101 red;circle
2 103 circle;blue
3 122 red;green
4 102 circle;red;green
5 213 blue
6 170 red;blue
当我运行table(df)时,你会得到以下结果:
table(df)
info
ids blue circle;blue circle;red;green red;blue red;circle
101 0 0 0 0 1
102 0 0 1 0 0
103 0 1 0 0 0
122 0 0 0 0 0
170 0 0 0 1 0
213 1 0 0 0 0
info
ids red;green
101 0
102 0
103 0
122 1
170 0
213 0
我想要做的是1.将信息栏分成两列,一列用于形状,一列用于颜色和2.分配任何具有多种颜色的ID,以及#34;多彩& #34 ;.所以我写了以下内容:
df$shape <- as.character(df$info)
for (i in 1:dim(df)[1]){
if (grepl("circle",df$info[i])==TRUE) {
df$shape[i] <- "circle"
} else if (grepl("circle",df$info[i])==FALSE) {
df$shape[i]<-NA}
}
for (i in 1:dim(df)[1]){
if (grepl(";",df$info[i])==TRUE) {
df$info[i] <- "Multicolored"
} else {df$info[i]<-df$info[i]}
}
从这段代码我得到输出:
df
ids info shape
1 101 Multicolored circle
2 103 Multicolored circle
3 122 Multicolored <NA>
4 102 Multicolored circle
5 213 blue <NA>
6 170 Multicolored <NA>
正如我的代码所写,它说像101 red;circle
这样的实例是多彩的,实际上它不是,只是红色和圆形。当&#34; circle&#34;是什么时候解析这些数据的正确方法是什么?可以出现在开头,中间或结尾的信息列中。任何和所有建议都欢迎,谢谢!
答案 0 :(得分:1)
对于这种类型的问题,在;
上拆分字符串然后使用字符串向量可能是有意义的。例如,
mystrings <- strsplit(df$info,";")
getStrings <- function(x,s,none=NA_character_,multiple="Multicolored")
switch(sum(x%in%s)+1,none,x[x%in%s],multiple,multiple)
df$shape <- sapply(mystrings,FUN=getStrings,s=c("circle"))
df$color <- sapply(mystrings,FUN=getStrings,s=c("red","green","blue"))
我个人觉得这种方法比尝试使用纯正则表达式和if语句更容易。
答案 1 :(得分:0)
我喜欢@farnsy的答案,但我想发布我的解决方案,它基本相似,但不要求你指定颜色(假设所有非形状都是颜色)。
# Load the data
df <- read.table(textConnection('ids info
1 101 red;circle
2 103 circle;blue
3 122 red;green
4 102 circle;red;green
5 213 blue
6 170 red;blue'),stringsAsFactors=FALSE)
# Split your column.
split.col <- strsplit(df$info,';')
# Specify which words are considered shapes.
shapes <- c('circle') # Could include more
# Find which rows had shapes.
df$shape <- sapply(split.col, function(x) x[match(shapes,x)[1]]) # Only selct one shape
# The rest must be colours, count them.
num.colours <- sapply(split.col, function(x) length(setdiff(x, shapes)))
df$multicoloured <- num.colours > 1
df
# ids info shape multicoloured
# 1 101 red;circle circle FALSE
# 2 103 circle;blue circle FALSE
# 3 122 red;green <NA> TRUE
# 4 102 circle;red;green circle TRUE
# 5 213 blue <NA> FALSE
# 6 170 red;blue <NA> TRUE
答案 2 :(得分:0)
您也可以尝试:
pat1 <- paste0(c("red","blue", "green"), collapse="|")
shape1 <- gsub(paste(pat1, ";", sep="|"), "", df$info)
shape1[shape1==''] <- NA
df[,c("info", "shape")] <- as.data.frame(do.call(rbind,
Map(`c`, lapply(regmatches(df$info, gregexpr(pat1, df$info)), function(x) {
if(length(x)>1) "Multicolored" else x}), shape1)), stringsAsFactors=FALSE)
df
# ids info shape
#1 101 red circle
#2 103 blue circle
#3 122 Multicolored <NA>
#4 102 Multicolored circle
#5 213 blue <NA>
#6 170 Multicolored <NA>