我的输入是这样的,
x
nct_id drug
1 NCT100 paracetomol+velacade
2 NCT123 bortezomib
3 NCT145 velacade
4 NCT645 velacade,dexamethaone
5 NCT768 bortezomib||velacde
6 NCT890 velacade\\bortezomib\\ethonisde
我使用以下代码一次根据各种分隔符分割第2列
y2<-strsplit(x[,2],split="[||,,,\\,+]")
> y2
[[1]]
[1] "paracetomol" "velacade"
[[2]]
[1] "bortezomib"
[[3]]
[1] "velacade"
[[4]]
[1] "velacade" "dexamethaone"
[[5]]
[1] "bortezomib" "" "velacde"
[[6]]
[1] "velacade" "bortezomib" "ethonisde"
我在5点获得额外的空间或角色,以便如何避免它
答案 0 :(得分:3)
您也可以通过修改正则表达式来解决此问题。我添加了第二个反斜杠来逃避第一个反斜杠并直接解决你的问题,添加了一个&#34; +&#34;告诉正则表达式引擎允许重复字符类中的尽可能多的字符&#34; [\ |,+]&#34;彼此相邻。
请注意,我将药物变量包装在as.character
中,因为它是一个因子变量,因为read.table
默认情况下将字符串转换为因子。
strsplit(as.character(df$drug), split="[\\|,+]+")
[[1]]
[1] "paracetomol" "velacade"
[[2]]
[1] "bortezomib"
[[3]]
[1] "velacade"
[[4]]
[1] "velacade" "dexamethaone"
[[5]]
[1] "bortezomib" "velacde"
[[6]]
[1] "velacade" "bortezomib" "ethonisde"
数据强>
df <- read.table(header=TRUE, text="nct_id drug
1 NCT100 paracetomol+velacade
2 NCT123 bortezomib
3 NCT145 velacade
4 NCT645 velacade,dexamethaone
5 NCT768 bortezomib||velacde
6 NCT890 velacade\\bortezomib\\ethonisde")
答案 1 :(得分:2)
我们可以使用str_extract
library(stringr)
str_extract_all(x$drug, "[A-Za-z]+")
#[[1]]
#[1] "paracetomol" "velacade"
#[[2]]
#[1] "bortezomib"
#[[3]]
#[1] "velacade"
#[[4]]
#[1] "velacade" "dexamethaone"
#[[5]]
#[1] "bortezomib" "velacde"
#[[6]]
#[1] "velacade" "bortezomib" "ethonisde"