我正在处理如下字符串
ID Col1
------------------------------------------------------------------------------------
11 GLIPIZIDE 10 MG TAB 1 TABLET PO QAM
23 GLIPIZIDE 5 MG TAB 2 TABLETS PO BID
32 GLIPIZIDE TAB PO
12 GLIPIZIDE TAB PO PRN
343 PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3
31 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
44 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3
34 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
38 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
我想要完成的是两件事。
1) Store the first word a new column (Col2)
2) Search for the term "mg" and capture the string before the word "mg"
and store that in a new column (Col3)
继续这个例子,最后的输出应该是这样的
Id Col2 Col3
---------------------------------
11 GLIPIZIDE 10 MG
23 GLIPIZIDE 5 MG
32 GLIPIZIDE
12 GLIPIZIDE
343 PIOGLITAZONE 45 MG
31 METFORMIN 500 MG
44 METFORMIN 500 MG
34 METFORMIN 500 MG
38 METFORMIN 500 MG
对这个问题的任何帮助都很受欢迎。
数据
dd <- read.table(header = TRUE, stringsAsFactors = FALSE, text="ID Col1
11 'GLIPIZIDE 10 MG TAB 1 TABLET PO QAM'
23 'GLIPIZIDE 5 MG TAB 2 TABLETS PO BID'
32 'GLIPIZIDE TAB PO'
12 'GLIPIZIDE TAB PO PRN'
343 'PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3'
31 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
44 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3'
34 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
38 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'")
答案 0 :(得分:3)
一个是使用两个正则表达式1)捕获字符串开头的第一个单词(^\\w+
)和2)找到数字后跟“mg”(\\d+ mg
)
dd <- read.table(header = TRUE, stringsAsFactors = FALSE, text="ID Col1
11 'GLIPIZIDE 10 MG TAB 1 TABLET PO QAM'
23 'GLIPIZIDE 5 MG TAB 2 TABLETS PO BID'
32 'GLIPIZIDE TAB PO'
12 'GLIPIZIDE TAB PO PRN'
343 'PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3'
31 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
44 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3'
34 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
38 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'")
within(dd, {
col1 <- gsub('(^\\w+)|.', '\\1', Col1)
dose <- gsub('(?i)(\\d+ mg)|.', '\\1', Col1)
})[, c('col1','dose')]
# col1 dose
# 1 GLIPIZIDE 10 MG
# 2 GLIPIZIDE 5 MG
# 3 GLIPIZIDE
# 4 GLIPIZIDE
# 5 PIOGLITAZONE 45 MG
# 6 METFORMIN 500 MG
# 7 METFORMIN 500 MG
# 8 METFORMIN 500 MG
# 9 METFORMIN 500 MG
答案 1 :(得分:1)
这是 stringi 。
library(stringi)
ss <- stri_extract_all_regex(dd$Col1, "(?i)(^\\w+)|(\\d+ mg)", simplify = TRUE)
setNames(cbind(dd[1], ss), c("ID", "Col2", "Col3")))
# ID Col2 Col3
# 1 11 GLIPIZIDE 10 MG
# 2 23 GLIPIZIDE 5 MG
# 3 32 GLIPIZIDE
# 4 12 GLIPIZIDE
# 5 343 PIOGLITAZONE 45 MG
# 6 31 METFORMIN 500 MG
# 7 44 METFORMIN 500 MG
# 8 34 METFORMIN 500 MG
# 9 38 METFORMIN 500 MG