从模式子集中提取R中文本的模式

时间:2016-04-15 07:48:16

标签: r pattern-matching text-mining stringr textpattern

我有以下代码列表

ccode<-c('S','PD','CH','ML','MD','VA','BVI','DB','KD','KE','PW','COL','AD','MET','VP','SI','VR','GAO','LK','RP','PAD','WAN','PWD','PMP','PBR','VN','PPC','NK','K','AH','I','JP','JU','UDZ','CHM','DDN','LN','CL','CLH','DKM','GK','WD','ED','DDK','DLN','DRN','DFD','GZB','DVV','GUR','GGN','ND','HHN','HAS','HYD','HKP','BWF','BBW','BKM','BSN','BL','BIN','ST','KN')

现在,我想从下面的示例中提取一个以代码

开头的字符串
consolidated_csv_v2 <- c("pt paid rs-8488/-  remaining amt","Credit Card Sales","ML 2926 VARSHA LAKHANI (AG)","IMRAN KHAN-PW-4798","Deepali Mishra Ah-5564 Tst", "MANJU S-11226 T","SNEHA S-16191","SUMIT SETHI AH-5747 AG","SUJATA VORA AH-5361 AG","Deepali Mishra Ah-5564 Tst")

数据分布在477326行

预期输出是代码,后跟数字。

str_extract(consolidated_csv_v2, "AH.*$")

[1] NA           NA           NA           NA           NA           NA          
[7] NA           "AH-5747 AG" "AH-5361 AG" NA  AG"

此公式仅适用于静态代码&#34; AH&#34;。如何与ccode中的任何代码匹配。

2 个答案:

答案 0 :(得分:2)

我们可以尝试

>> import sys
>> reload(sys)
>> sys.getdefaultencoding();
--> "ascii"

Run my command

>> import sys
>> reload(sys)
>> sys.getdefaultencoding();
--> "ascii"

数据

pat <- paste0("(?i)\\b(", paste(ccode, collapse="|"),")-.*")
str_extract(v1, pat)
#[1] NA            NA            NA            NA            "Ah-5564 Tst" NA            "AH-2445 AG"  "AH-5747 AG"  "AH-5361 AG"  "Ah-5564 Tst"

答案 1 :(得分:2)

我假设您需要提取以字符边界后面的“代码”开头并以连字符开头的子字符串。

然后,使用

 "\\b(?:S|PD|CH|ML|MD|VA|BVI|DB|KD|KE|PW|COL|AD|MET|VP|SI|VR|GAO|LK|RP|PAD|WAN|PWD|PMP|PBR|VN|PPC|NK|K|AH|I|JP|JU|UDZ|CHM|DDN|LN|CL|CLH|DKM|GK|WD|ED|DDK|DLN|DRN|DFD|GZB|DVV|GUR|GGN|ND|HHN|HAS|HYD|HKP|BWF|BBW|BKM|BSN|BL|BIN|ST|KN)-\\w*"

其中\b代表单词边界,然后是一组代码替代((?:...)),然后是连字符(-),后跟零个或多个字母数字/下划线符号(\w*)。

这是一个演示代码:

> consolidated_csv_v2 <- c("Head Office","(cancelled)","(cancelled)","(cancelled)","Deepali Mishra Ah-5564 Tst", "(cancelled)","SHRUTI BHAGAT AH-2445 AG","SUMIT SETHI AH-5747 AG","SUJATA VORA AH-5361 AG","Deepali Mishra Ah-5564 Tst")
> ccode<-c('S','PD','CH','ML','MD','VA','BVI','DB','KD','KE','PW','COL','AD','MET','VP','SI','VR','GAO','LK','RP','PAD','WAN','PWD','PMP','PBR','VN','PPC','NK','K','AH','I','JP','JU','UDZ','CHM','DDN','LN','CL','CLH','DKM','GK','WD','ED','DDK','DLN','DRN','DFD','GZB','DVV','GUR','GGN','ND','HHN','HAS','HYD','HKP','BWF','BBW','BKM','BSN','BL','BIN','ST','KN')
> reg <- paste0("\\b(?:", paste(ccode, collapse="|"),")-\\w*")
> str_extract(consolidated_csv_v2, reg)
 [1] NA        NA        NA        NA        NA        NA        "AH-2445"
 [8] "AH-5747" "AH-5361" NA       
> 

<强>更新

  

并非所有单词后跟' - ',有些单词后跟空格,有些单词之间没有任何字符。

这个要求相当普遍,但是我们可以在替换组之后使用惰性点匹配(.*?)来匹配它,以匹配除换行之外的任何0+字符,直到第一组数字(\d+)后跟一个单词边界(\b)。使用

reg <- paste0("(?i)\\b(?:", paste(ccode, collapse="|"),").*?\\d+\\b")

请参阅regex demo

要使此模式不区分大小写 ,只需在第一个(?i)前添加\b