解析R中多个字符串的数据

时间:2014-08-22 16:46:32

标签: r grepl

我正在尝试编写一个代码,该代码将解析包含多条信息的单个列。例如,假设我有以下数据框叫做df:

  ids             info
1 101       red;circle
2 103      circle;blue
3 122        red;green
4 102 circle;red;green
5 213             blue
6 170         red;blue

当我运行table(df)时,你会得到以下结果:

    table(df)
         info
    ids   blue circle;blue circle;red;green red;blue red;circle
      101    0           0                0        0          1
      102    0           0                1        0          0
      103    0           1                0        0          0
      122    0           0                0        0          0
      170    0           0                0        1          0
      213    1           0                0        0          0
         info
    ids   red;green
      101         0
      102         0
      103         0
      122         1
      170         0

  213         0

我想要做的是1.将信息栏分成两列,一列用于形状,一列用于颜色和2.分配任何具有多种颜色的ID,以及#34;多彩& #34 ;.所以我写了以下内容:

df$shape <- as.character(df$info)
for (i in 1:dim(df)[1]){
  if (grepl("circle",df$info[i])==TRUE) {
    df$shape[i] <- "circle" 
  } else if (grepl("circle",df$info[i])==FALSE) {
    df$shape[i]<-NA}
}
for (i in 1:dim(df)[1]){
  if (grepl(";",df$info[i])==TRUE) {
    df$info[i] <- "Multicolored" 
  } else {df$info[i]<-df$info[i]}
}

从这段代码我得到输出:

df
  ids         info  shape
1 101 Multicolored circle
2 103 Multicolored circle
3 122 Multicolored   <NA>
4 102 Multicolored circle
5 213         blue   <NA>
6 170 Multicolored   <NA>

正如我的代码所写,它说像101 red;circle这样的实例是多彩的,实际上它不是,只是红色和圆形。当&#34; circle&#34;是什么时候解析这些数据的正确方法是什么?可以出现在开头,中间或结尾的信息列中。任何和所有建议都欢迎,谢谢!

3 个答案:

答案 0 :(得分:1)

对于这种类型的问题,在;上拆分字符串然后使用字符串向量可能是有意义的。例如,

mystrings <- strsplit(df$info,";")
getStrings <- function(x,s,none=NA_character_,multiple="Multicolored")
   switch(sum(x%in%s)+1,none,x[x%in%s],multiple,multiple)
df$shape <- sapply(mystrings,FUN=getStrings,s=c("circle"))
df$color <- sapply(mystrings,FUN=getStrings,s=c("red","green","blue"))

我个人觉得这种方法比尝试使用纯正则表达式和if语句更容易。

答案 1 :(得分:0)

我喜欢@farnsy的答案,但我想发布我的解决方案,它基本相似,但不要求你指定颜色(假设所有非形状都是颜色)。

# Load the data
df <- read.table(textConnection('ids             info
1 101       red;circle
2 103      circle;blue
3 122        red;green
4 102 circle;red;green
5 213             blue
6 170         red;blue'),stringsAsFactors=FALSE)

# Split your column.
split.col <- strsplit(df$info,';')
# Specify which words are considered shapes.
shapes <- c('circle') # Could include more
# Find which rows had shapes.
df$shape <- sapply(split.col, function(x) x[match(shapes,x)[1]]) # Only selct one shape
# The rest must be colours, count them.
num.colours <- sapply(split.col, function(x) length(setdiff(x, shapes)))
df$multicoloured <- num.colours > 1

df
#   ids             info  shape multicoloured
# 1 101       red;circle circle         FALSE
# 2 103      circle;blue circle         FALSE
# 3 122        red;green   <NA>          TRUE
# 4 102 circle;red;green circle          TRUE
# 5 213             blue   <NA>         FALSE
# 6 170         red;blue   <NA>          TRUE

答案 2 :(得分:0)

您也可以尝试:

 pat1 <- paste0(c("red","blue", "green"), collapse="|")
shape1 <- gsub(paste(pat1, ";", sep="|"), "", df$info)
shape1[shape1==''] <- NA
df[,c("info", "shape")] <- as.data.frame(do.call(rbind,
                Map(`c`, lapply(regmatches(df$info, gregexpr(pat1, df$info)), function(x)   {
             if(length(x)>1) "Multicolored" else x}), shape1)), stringsAsFactors=FALSE)

 df
 #  ids         info  shape
 #1 101          red circle
 #2 103         blue circle
 #3 122 Multicolored   <NA>
 #4 102 Multicolored circle
 #5 213         blue   <NA>
 #6 170 Multicolored   <NA>