R分隔字符串并将它们链接到另一列

时间:2012-08-10 05:08:48

标签: string r

我的数据如下:

DocID             Impact
CCRB-9-569  114;Adaptation - Strategic
CCRB-9-531  173;Nutrient trading
CCRB-9-886  
CCRB-9-989  
CCRB-9-530  71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall
CCRB-9-671  106;Adaptation Responses;98;Climate Change
CCRB-9-570  114;Adaptation - Strategic
CCRB-9-990  
CCRB-9-526  98;Climate Change

理想情况下,我希望最终得到:

DocID             Impact
CCRB-9-569  Adaptation - Strategic
CCRB-9-531  Nutrient trading
CCRB-9-886  
CCRB-9-989  
CCRB-9-530  Change in Temperature
CCRB-9-530  Extreme weather events
CCRB-9-530  Lower Rainfall
CCRB-9-671  Adaptation Responses
CCRB-9-671  Climate Change
CCRB-9-570  Adaptation - Strategic
CCRB-9-990  
CCRB-9-526  Climate Change

我开始尝试

test1=lapply(unlist(strsplit(test$Impact,"\\;")),as.character)

但是没有能力链接回DocID并且没有为没有输入的行获取任何空格。我已经玩过将unlist列出来,试图重新使用,使用cbind.fill函数,合并等,但我遗漏了一些东西。如果Impact列(114,173等)中的数字最终出现在输出文件中,那就没问题了,只要它们被分配了正确的DocID号。 谢谢你的帮助

3 个答案:

答案 0 :(得分:3)

类似的data.table解决方案

# some dummy data
.data <- data.frame(id = letters[1:5], text = c('12;a-b;34','','a-c','a-c;12;12',''))
# make both columns character, not factor, and make it a data.table
.data <- as.data.table(lapply(.data, as.character))
# for each id, split and return (returning '' if nothing)


.data[, { value = unlist(strsplit(text,split = '\\;')) 
          if (length(value) == 0) text else value },
        by = id]

答案 1 :(得分:2)

我无法获得@csgillespie的功能来正确地执行strsplit,所以我自己做了:

 foo <- function(x){  ivec <-                   
  unlist(    # needed to convert the list from strsplit to a vector.
  # The regex split pattern can be read as 
     #---- "find any sections possibly but not necessarily starting with a space or ";"
     # --- "followed necessarily by one or more digits and a ";"
  # strsplit will split and remove these segments.

     strsplit( as.character(x), split= "\\s?;?[[:digit:]]+;" ))   

   #Need to remove length zero items except for the DocID's that don't have any   

     if( any(nchar(ivec))>0){ ivec[nchar(ivec) >0 ] }else{""}
    } # end of function.

 out <- ddply(dta, .(DocID), summarise, Impact=foo(Impact) )
 out
#--------------
         DocID                 Impact
1  CCRB-9-526          Climate Change
2  CCRB-9-530   Change in Temperature
3  CCRB-9-530  Extreme weather events
4  CCRB-9-530          Lower Rainfall
5  CCRB-9-531        Nutrient trading
6  CCRB-9-569  Adaptation - Strategic
7  CCRB-9-570  Adaptation - Strategic
8  CCRB-9-671    Adaptation Responses
9  CCRB-9-671          Climate Change
10 CCRB-9-886                        
11 CCRB-9-989                        
12 CCRB-9-990                        

构建测试用例(需要使用非空格分隔符):

dta <- read.table(text="DocID     |        Impact
 CCRB-9-569 | 114;Adaptation - Strategic
 CCRB-9-531 | 173;Nutrient trading
 CCRB-9-886 | 
 CCRB-9-989 | 
 CCRB-9-530 | 71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall
 CCRB-9-671 | 106;Adaptation Responses;98;Climate Change
 CCRB-9-570 | 114;Adaptation - Strategic
 CCRB-9-990 | 
 CCRB-9-526 | 98;Climate Change", header=TRUE, sep="|")

答案 2 :(得分:0)

您可以使用plyr包相当轻松地完成此操作。首先,创建一些虚拟数据并加载包:

dd = data.frame(DocID = c("CCRB-9-569", "CCRB-9-530", "CCRB-9-886"),
                 Impact=c("114;Adaptation - Strategic", 
     "71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall",
                          ""), stringsAsFactors=FALSE)
library(plyr)

接下来,我们创建一个适用于Impact列的函数:

f = function(i) { 
    l = unlist(strsplit(as.character(i),";"))
    ##Need to determine if the string was empty
    if(length(l)> 1) l = l[seq(2, length(l), by=2)]
    return(l)

}

然后我们使用ddply

ddply(dd, "DocID", summarise, Impact = f(Impact))

此处我们将dd作为输入,由DocID分开并将functionf f应用于Impact chunk。


注意,我的函数f假设您要将字符串拆分为;

功能逻辑

plyr函数“创建”较小的数据框,以DocID值为条件。然后我假设特定DocID值的格式为:

 Number;string;Number;string;Number;string

当我们基于;进行拆分时,我们得到了向量:

Number, string, Number, string, Number, string

所以我们只需要选择偶数元素,即

l[seq(2, length(l), 2)]