在R中给定条件后,从列“ n”个字符中删除字符

时间:2018-06-28 05:08:43

标签: r dplyr substring

我想删除此列中所有内容,在'18'之后的3个字符

MGL18JUNFUT
NATIONALUM18JUNFUT
NTPC18JUNFUT
ONGC18JUNFUT
PCJEWELLER18JUNFUT
PEL18JUNFUT
PFC18JUNFUT
PIDILITIND18JUNFUT
POWERGRID18JULFUT
PTC18JULFUT
RAYMOND18JULFUT
RBLBANK18JULFUT
RECLTD18JULFUT
RPOWER18JULFUT
MGL18JUN800PE

我希望我的输出看起来像

MGL18JUN
NATIONALUM18JUN
NTPC18JUN
ONGC18JUN
PCJEWELLER18JUN
PEL18JUN
PFC18JUN
PIDILITIND18JUN
POWERGRID18JUL
PTC18JUL
RAYMOND18JUL
RBLBANK18JUL
RECLTD18JUL
RPOWER18JUL
MGL18JUN

我尝试了以下代码。

output <- sub('(^.*?)18???.*', '' , df$column)

但是输出即将到来

8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUN800PE

Excel与此等效。

=LEFT(A1, FIND("18",A1,1) +4)

我尝试了许多其他选项,例如sub,gregexpr,substr,但似乎没有任何作用

3 个答案:

答案 0 :(得分:5)

我们可以通过捕获以下字符的样式来更改sub:{.*,后跟18,然后从零到三个字符(.{0,3},特别是三个字符(.{3})组((...)中的一个,然后替换为捕获组的后向引用(\\1

sub("^(.*18.{0,3}).*", "\\1", df$column)

sub("^(.*18.{3}).*", "\\1", df$column)
#[1] "MGL18JUN"        "NATIONALUM18JUN" "NTPC18JUN"       "ONGC18JUN"      
#[5] "PCJEWELLER18JUN" "PEL18JUN"        "PFC18JUN"        "PIDILITIND18JUN"
#[9] "POWERGRID18JUL"  "PTC18JUL"        "RAYMOND18JUL"    "RBLBANK18JUL"   
#[13] "RECLTD18JUL"     "RPOWER18JUL"     "MGL18JUN"       

根据OP的评论,如果存在多个18实例

v1 <- "PIDILITIND18JUN1180CE"
sub("^(.*?18.{3}).*", "\\1", v1)

它也可以处理初始数据

sub("^(.*?18.{3}).*", "\\1", df$column)
#[1] "MGL18JUN"        "NATIONALUM18JUN" "NTPC18JUN"       "ONGC18JUN"      
#[5] "PCJEWELLER18JUN" "PEL18JUN"        "PFC18JUN"        "PIDILITIND18JUN"
#[9] "POWERGRID18JUL"  "PTC18JUL"        "RAYMOND18JUL"    "RBLBANK18JUL"   
#[13] "RECLTD18JUL"     "RPOWER18JUL"     "MGL18JUN"       

数据

df <- structure(list(column = c("MGL18JUNFUT", "NATIONALUM18JUNFUT", 
"NTPC18JUNFUT", "ONGC18JUNFUT", "PCJEWELLER18JUNFUT", "PEL18JUNFUT", 
"PFC18JUNFUT", "PIDILITIND18JUNFUT", "POWERGRID18JULFUT", "PTC18JULFUT", 
"RAYMOND18JULFUT", "RBLBANK18JULFUT", "RECLTD18JULFUT", "RPOWER18JULFUT", 
"MGL18JUN800PE")), .Names = "column", class = "data.frame",
row.names = c(NA, 
-15L))

答案 1 :(得分:3)

您也可以使用stringr::str_extract

stringr::str_extract(string, "(.*)18\\w{3}")

逻辑:

str_extract 提取正则表达式(正则表达式匹配)。在这里,我尝试匹配所有内容(使用。*,.表示任何字符,*匹配零个或多个字符)直到18,然后提取3个字母(由字母和数字组成,并用\ w和{3}组成),另请注意,如果您确实希望它提取1到3之间的内容,则可以使用{m,n},其中m表示最小匹配数,n表示最大匹配数。 示例:\ w {2,3}可以匹配任何具有2个或3个字母的字符串,依此类推。您可以使用help(regex)对其进行详细了解。谢谢。我希望这会有所帮助。

输出:

#> stringr::str_extract(string, "(.*)18\\w{3}")
# [1] "MGL18JUN"        "NATIONALUM18JUN" "NTPC18JUN"       "ONGC18JUN"      
# [5] "PCJEWELLER18JUN" "PEL18JUN"        "PFC18JUN"        "PIDILITIND18JUN"
# [9] "POWERGRID18JUL"  "PTC18JUL"        "RAYMOND18JUL"    "RBLBANK18JUL"   
# [13] "RECLTD18JUL"     "RPOWER18JUL"     "MGL18JUN" 

输入:

string <- c("MGL18JUNFUT",
"NATIONALUM18JUNFUT",
"NTPC18JUNFUT",
"ONGC18JUNFUT",
"PCJEWELLER18JUNFUT",
"PEL18JUNFUT",
"PFC18JUNFUT",
"PIDILITIND18JUNFUT",
"POWERGRID18JULFUT",
"PTC18JULFUT",
"RAYMOND18JULFUT",
"RBLBANK18JULFUT",
"RECLTD18JULFUT",
"RPOWER18JULFUT",
"MGL18JUN800PE")

编辑:-


如果您的数据中有多个18,并且希望匹配到前18个,则可以使用*停止正则表达式字符?的贪婪,如下所示:

stringr::str_extract(string, "(.*?)18\\w{3}")

答案 2 :(得分:3)

编辑: 在注释部分中,OP表示18本身第一次出现后,OP需要3个字符,然后我提出了此正则表达式现在也一样。

x <- c("MGL18JUNFUT","NATIONALUM18JUNFUT18SHDGUDDG","NTPC18JUNFUT","ONGC18JUNFUT","PCJEWELLER18JUNFUT","PEL18JUNFUT","PFC18JUNFUT","PIDILITIND18JUNFUT","POWERGRID18JULFUT","PTC18JULFUT","RAYMOND18JULFUT","RBLBANK18JULFUT","RECLTD18JULFUT","RPOWER18JULFUT","MGL18JUN800PE")
> 
regmatches(x,regexpr("(.*?)18.{3}",x))

输出如下。

> regmatches(x,regexpr("(.*?)18.{3}",x))
 [1] "MGL18JUN"        "NATIONALUM18JUN" "NTPC18JUN"       "ONGC18JUN"      
 [5] "PCJEWELLER18JUN" "PEL18JUN"        "PFC18JUN"        "PIDILITIND18JUN"
 [9] "POWERGRID18JUL"  "PTC18JUL"        "RAYMOND18JUL"    "RBLBANK18JUL"   
[13] "RECLTD18JUL"     "RPOWER18JUL"     "MGL18JUN"       
> 

以向量为例,您也可以在此处使用数据框。

x <- c("MGL18JUNFUT","NATIONALUM18JUNFUT","NTPC18JUNFUT","ONGC18JUNFUT","PCJEWELLER18JUNFUT","PEL18JUNFUT","PFC18JUNFUT","PIDILITIND18JUNFUT","POWERGRID18JULFUT","PTC18JULFUT","RAYMOND18JULFUT","RBLBANK18JULFUT","RECLTD18JULFUT","RPOWER18JULFUT","MGL18JUN800PE")

以下是此代码。

regmatches(x,regexpr("^.*18.{3}",x))

输出如下。

> regmatches(x,regexpr("^.*18.{3}",x))
 [1] "MGL18JUN"        "NATIONALUM18JUN" "NTPC18JUN"       "ONGC18JUN"      
 [5] "PCJEWELLER18JUN" "PEL18JUN"        "PFC18JUN"        "PIDILITIND18JUN"
 [9] "POWERGRID18JUL"  "PTC18JUL"        "RAYMOND18JUL"    "RBLBANK18JUL"   
[13] "RECLTD18JUL"     "RPOWER18JUL"     "MGL18JUN"       
>