我的数据如下:
DocID Impact
CCRB-9-569 114;Adaptation - Strategic
CCRB-9-531 173;Nutrient trading
CCRB-9-886
CCRB-9-989
CCRB-9-530 71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall
CCRB-9-671 106;Adaptation Responses;98;Climate Change
CCRB-9-570 114;Adaptation - Strategic
CCRB-9-990
CCRB-9-526 98;Climate Change
理想情况下,我希望最终得到:
DocID Impact
CCRB-9-569 Adaptation - Strategic
CCRB-9-531 Nutrient trading
CCRB-9-886
CCRB-9-989
CCRB-9-530 Change in Temperature
CCRB-9-530 Extreme weather events
CCRB-9-530 Lower Rainfall
CCRB-9-671 Adaptation Responses
CCRB-9-671 Climate Change
CCRB-9-570 Adaptation - Strategic
CCRB-9-990
CCRB-9-526 Climate Change
我开始尝试
test1=lapply(unlist(strsplit(test$Impact,"\\;")),as.character)
但是没有能力链接回DocID并且没有为没有输入的行获取任何空格。我已经玩过将unlist列出来,试图重新使用,使用cbind.fill函数,合并等,但我遗漏了一些东西。如果Impact列(114,173等)中的数字最终出现在输出文件中,那就没问题了,只要它们被分配了正确的DocID号。 谢谢你的帮助
答案 0 :(得分:3)
类似的data.table
解决方案
# some dummy data
.data <- data.frame(id = letters[1:5], text = c('12;a-b;34','','a-c','a-c;12;12',''))
# make both columns character, not factor, and make it a data.table
.data <- as.data.table(lapply(.data, as.character))
# for each id, split and return (returning '' if nothing)
.data[, { value = unlist(strsplit(text,split = '\\;'))
if (length(value) == 0) text else value },
by = id]
答案 1 :(得分:2)
我无法获得@csgillespie的功能来正确地执行strsplit
,所以我自己做了:
foo <- function(x){ ivec <-
unlist( # needed to convert the list from strsplit to a vector.
# The regex split pattern can be read as
#---- "find any sections possibly but not necessarily starting with a space or ";"
# --- "followed necessarily by one or more digits and a ";"
# strsplit will split and remove these segments.
strsplit( as.character(x), split= "\\s?;?[[:digit:]]+;" ))
#Need to remove length zero items except for the DocID's that don't have any
if( any(nchar(ivec))>0){ ivec[nchar(ivec) >0 ] }else{""}
} # end of function.
out <- ddply(dta, .(DocID), summarise, Impact=foo(Impact) )
out
#--------------
DocID Impact
1 CCRB-9-526 Climate Change
2 CCRB-9-530 Change in Temperature
3 CCRB-9-530 Extreme weather events
4 CCRB-9-530 Lower Rainfall
5 CCRB-9-531 Nutrient trading
6 CCRB-9-569 Adaptation - Strategic
7 CCRB-9-570 Adaptation - Strategic
8 CCRB-9-671 Adaptation Responses
9 CCRB-9-671 Climate Change
10 CCRB-9-886
11 CCRB-9-989
12 CCRB-9-990
构建测试用例(需要使用非空格分隔符):
dta <- read.table(text="DocID | Impact
CCRB-9-569 | 114;Adaptation - Strategic
CCRB-9-531 | 173;Nutrient trading
CCRB-9-886 |
CCRB-9-989 |
CCRB-9-530 | 71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall
CCRB-9-671 | 106;Adaptation Responses;98;Climate Change
CCRB-9-570 | 114;Adaptation - Strategic
CCRB-9-990 |
CCRB-9-526 | 98;Climate Change", header=TRUE, sep="|")
答案 2 :(得分:0)
您可以使用plyr
包相当轻松地完成此操作。首先,创建一些虚拟数据并加载包:
dd = data.frame(DocID = c("CCRB-9-569", "CCRB-9-530", "CCRB-9-886"),
Impact=c("114;Adaptation - Strategic",
"71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall",
""), stringsAsFactors=FALSE)
library(plyr)
接下来,我们创建一个适用于Impact
列的函数:
f = function(i) {
l = unlist(strsplit(as.character(i),";"))
##Need to determine if the string was empty
if(length(l)> 1) l = l[seq(2, length(l), by=2)]
return(l)
}
然后我们使用ddply
:
ddply(dd, "DocID", summarise, Impact = f(Impact))
此处我们将dd
作为输入,由DocID分开并将functionf f
应用于Impact chunk。
注意,我的函数f
假设您要将字符串拆分为;
功能逻辑
plyr
函数“创建”较小的数据框,以DocID
值为条件。然后我假设特定DocID
值的格式为:
Number;string;Number;string;Number;string
当我们基于;
进行拆分时,我们得到了向量:
Number, string, Number, string, Number, string
所以我们只需要选择偶数元素,即
l[seq(2, length(l), 2)]