我的数据框看起来像这样-
df = data.frame(Entity=c('MM > OSS > EUROPE_lv3 > FRANCElv4 > FRANCElv5 > FRANCElv6 > FR_07_FRANCE > FR_08_FRANCE > FR_09_S50 > FR_10_DVPOL12 > FR_11_DRYPBA > FR_12_RYPOP9000 > FR_13_SX362707 > SO362707',
'MM > OSS > AMERICA_lv3 > AMERICA_11lv4 > AMERICA_11lv5 > INC_11 > INCEDUSCHOOLFACIL_11 > INCEDUC > 30-00002 > 40-00056 > 50-00556 > 60-59003 > 27747001lv13 > 27747003lv14',
'MM > OSS > AMERICA_lv3 > AMERICA_11lv4 > AMERICA_11lv5 > INC_11 > INCEDUSCHOOLFACIL_11 > INCEDUC > 30-00002 > 40-00056 > 50-00061 > 60-23617 > 76929001lv13 > 76929017lv14',
'MM > OSS > EUROPE_lv3 > UKIRELAND_13lv4 > UKIRELAND_13lv5 > UKIRELAND_13lv6 > UKIE160000 > UKIE160000_lv8 > UKIE160000_lv9 > UKIE262000 > UKIE362004 > UKIE462006 > UKIE562072 > GB344496',
'MM > OSS > AMERICA_lv3 > AMERICA_11lv4 > AMERICA_11lv5 > INC_11 > INCEDUSCHOOLFACIL_11 > INCEDUC > 30-00002 > 40-00056 > 50-00065 > 60-22505 > 94276001lv13 > 94276002lv14'))
我的目标是-
我的尝试
要提取<<>最后一个实例之后的所有内容,我尝试过-
sub("^.+< ", "", df$Entity)
但是它不能按预期工作。
在解决问题1)和2)方面的任何帮助将不胜感激。
答案 0 :(得分:2)
我们可以尝试在最后一列中使用sub
,如下所示:
df$last <- sub("^.*>\\s*", "", df$Entity)
对于>
的第二个实例和第三个实例之间的列:
df$between <- sub("^(?:[^>]+>){2}\\s*([^> ]+).*$", "\\1", df$Entity)
df[, c("last", "between")]
last between
1 SO362707 EUROPE_lv3
2 27747003lv14 AMERICA_lv3
3 76929017lv14 AMERICA_lv3
4 GB344496 EUROPE_lv3
5 94276002lv14 AMERICA_lv3
这是第二个正则表达式的解释:
^ from the start of the input
(?:[^>]+>){2} match the first two components 'COMPONENT >'
\s* match optional whitespace
([^> ]+) then match AND capture the third component
.* consume the rest of the input until reaching
$ the end of the input
答案 1 :(得分:1)
您始终可以在'>'上拆分,然后提取要保留的元素。
限制:使用更多的内存,并假设每个字符串中的'>'数量相等
data.table:
library(data.table)
setDT(df)
df[, tstrsplit(Entity, ' > ')][, .(two2three = V3, last = V14)]
# two2three last
# 1: EUROPE_lv3 SO362707
# 2: AMERICA_lv3 27747003lv14
# 3: AMERICA_lv3 76929017lv14
# 4: EUROPE_lv3 GB344496
# 5: AMERICA_lv3 94276002lv14
基础:
df$Entity <- as.character(df$Entity)
setNames(
as.data.frame(
do.call(rbind, lapply(strsplit(df$Entity, ' > '), '[', c(3, 14)))
), c('two2three', 'last'))
# two2three last
# 1 EUROPE_lv3 SO362707
# 2 AMERICA_lv3 27747003lv14
# 3 AMERICA_lv3 76929017lv14
# 4 EUROPE_lv3 GB344496
# 5 AMERICA_lv3 94276002lv14