Grep搜索任何大写的任何顺序的某些字符

时间:2016-10-06 23:13:07

标签: regex grep

我需要grep搜索字符“james”。问题是它们可以按任何顺序排列,并且它们中的任何一个都可以大写。例如,需要找到以下内容..

  • Aemjs
  • emaJS
  • SeMaJ

这只是一些可能性,显然还有更多的组合。

如果可能,这需要使用一个grep语句完成。 grep搜索是使用Encase软件完成的,而不是在unix机器上完成。输入只接受一个grep命令。是否有可能在一个声明中这样做?

4 个答案:

答案 0 :(得分:4)

这是一个有趣的正则表达式:

/ (?=.{0,4}j)(?=.{0,4}a)(?=.{0,4}m)(?=.{0,4}e)(?=.{0,4}s).{5} /i

这样做是为了确保字符串中的每个字符都是" james"在接下来的5个字符内匹配,i修饰符使其不区分大小写。

Demo on Regex101

将它放入grep,你会得到类似的结果:

grep -Pi " (?=.{0,4}j)(?=.{0,4}a)(?=.{0,4}m)(?=.{0,4}e)(?=.{0,4}s).{5} " $file

$file是您希望grep通过的文件。请注意,-P标记需要GNU grep,并指示该模式是Perl样式的正则表达式(原生grep正则表达式,即使使用-E,也不要支持前瞻)。 -i标志使其不区分大小写。

答案 1 :(得分:2)

向前看每个必须出现在单词中的字母,5个字母,全部包含在单词边界内:

void puts (char *s)
{
    while (s && *s)
        putchar (*s++);
}   

答案 2 :(得分:1)

让我们在TXR Lisp中修改一个小的grep程序,强调与插入符的匹配:

#!/usr/local/bin/txr --lisp
(let ((regex (regex-compile (first *args*))))
  (whilet ((line (get-line)))
    (whenlet ((mlist (rra regex line))) ;; rra: regex ranges all
      (put-line line)
      (let ((carets (mkstring (to (find-max mlist)) #\space)))
        (mapdo (op mapdo (do set [carets @1] #\^) (range* (from @1) (to @1)))
               mlist)
        (put-line carets)))))

运行(仅大写;小写添加小写):

$ chmod a+x txgrep
$ ./txgrep '.....&.*J.*&.*A.*&.*M.*&.*E.*&.*S.*'
JAMES
JAMES
^^^^^
Hey there, JAMES; meet AMSEJAMS.
Hey there, JAMES; meet AMSEJAMS.
           ^^^^^       ^^^^^
SJAMSSEMASMSJEMSAMSESAMJESESJASMAS
SJAMSSEMASMSJEMSAMSESAMJESESJASMAS
            ^^^^^  ^^^^^
J
AJAMES
AJAMES
 ^^^^^

正则表达式只表示由五个字符组成的字符串集(..... AND ,其中包含J(.*J.* AND 它同样包含A等等。

如果我们在一个单词中重复了一个字母并且它们都必须存在,就像说DOLLY一样,那就是这样的:

.....&.*D.*&.*O.*&.*L.*L.*&.*Y.*

L覆盖了两个.*L.*L.*,它们匹配包含至少两个L-s的字符串集。如果我们至少有两个L-s,至少一个D,至少一个O,和至少一个Y,并且长度是五个字符,我们必须有一个DOLLY的混乱。

.....&.*J.*&.*A.*&.*M.*&.*E.*&.*S.*开始,我们可以做一些代数来消除异域&运算符吗?如果我们可以使用分支,连接等代数地将其简化为可管理的普通正则表达式,我们可以使用普通工具(我的意思是没有愚蠢的Perl扩展或任何东西:只是旧的NFA正则表达式。)

一个大的联合立即建议De Morgan的法律(A & B <==> ~(~A | ~B)),它引入了否定。那么我们可以消除否定吗?

--> ~(~.....|~.*J.*|~.*A.*|~.*M.*|~.*E.*|~.*S.*)

啊哈!首先~.....表示&#34;所有字符串长度不超过五个字符&#34;。这很简单:它只是包含以下内容的集合:空字符串,所有one-char-long字符串,所有两个char-long字符串,....不是任何五字符长字符串,六个字符串长串等。我们可以轻松地表达~

(|.|..|...|....|......+)

接下来,~.*J.*表示的集合只是不包含J的字符串集。简单的角色课程!它只是[^J]*!好的,所以我们可以将它替换为我们的正则表达式,然后我们所拥有的只是一个大的否定:

~(|.|..|...|....|......+|[^J]*|[^A]*|[^M]*|[^E]*|[^S]*)

顺便说一句,让我们检查一下这是否仍然有效:

$ ./txgrep '~(|.|..|...|....|......+|[^J]*|[^A]*|[^M]*|[^E]*|[^S]*)'
JAMES
JAMES
^^^^^
Hey there, JAMES, meet AMSEJAMS.
Hey there, JAMES, meet AMSEJAMS.
           ^^^^^       ^^^^^
SJAMSSEMASMSJEMSAMSESAMJESESJASMAS
SJAMSSEMASMSJEMSAMSESAMJESESJASMAS
            ^^^^^  ^^^^^
J
AJAMES
AJAMES
 ^^^^^

显然,是的。呼!

我们可以从哪里出发?

这是一个疯狂的想法:假设我们允许自己使用单词锚定。然后我们可以使用grep -v找到包含密码候选者的行吗?

$ grep -v -E '\<(|.|..|...|....|......+|[^J]*|[^A]*|[^M]*|[^E]*|[^S]*)\>'

不幸的是,这和行锚定一样好:只有当每行有一个单词时才有效。它会找到正好是JAMESMAJSE等等的行,但不会找到其他内容。

当然,我们知道有一个普通的经典正则表达式匹配我们想要的集合,因为集合是一个单词的有限排列;它的明显正则表达方式太大了。

或者是吗?不是真的:

grep -E -i "JAMES|JAMSE|JAEMS|JAESM|JASME|JASEM|JMAES|JMASE|JMEAS|JMESA|JMSAE|JMSEA|JEAMS|JEASM|JEMAS|JEMSA|JESAM|JESMA|JSAME|JSAEM|JSMAE|JSMEA|JSEAM|JSEMA|AJMES|AJMSE|AJEMS|AJESM|AJSME|AJSEM|AMJES|AMJSE|AMEJS|AMESJ|AMSJE|AMSEJ|AEJMS|AEJSM|AEMJS|AEMSJ|AESJM|AESMJ|ASJME|ASJEM|ASMJE|ASMEJ|ASEJM|ASEMJ|MJAES|MJASE|MJEAS|MJESA|MJSAE|MJSEA|MAJES|MAJSE|MAEJS|MAESJ|MASJE|MASEJ|MEJAS|MEJSA|MEAJS|MEASJ|MESJA|MESAJ|MSJAE|MSJEA|MSAJE|MSAEJ|MSEJA|MSEAJ|EJAMS|EJASM|EJMAS|EJMSA|EJSAM|EJSMA|EAJMS|EAJSM|EAMJS|EAMSJ|EASJM|EASMJ|EMJAS|EMJSA|EMAJS|EMASJ|EMSJA|EMSAJ|ESJAM|ESJMA|ESAJM|ESAMJ|ESMJA|ESMAJ|SJAME|SJAEM|SJMAE|SJMEA|SJEAM|SJEMA|SAJME|SAJEM|SAMJE|SAMEJ|SAEJM|SAEMJ|SMJAE|SMJEA|SMAJE|SMAEJ|SMEJA|SMEAJ|SEJAM|SEJMA|SEAJM|SEAMJ|SEMJA|SEMAJ"

grep中的不区分大小写(POSIX功能)可以解决大部分组合爆炸问题。实际上,如果没有-i,我们所要做的就是用类J替换[Jj]之类的每个字母。这使我们的模式变大了几倍,但仍然可以合理地管理。

从上面开始,我们可以折叠路径压缩。例如,JAMES的几个混乱,其实六个匹配:

J(A(M(ES|SE)|E(MS|SM)|S(ME|EM)))

这比完全写出六个要短一点:

JAMES|JAMSE|JAEMS|JAESM|JASME|JASEM

现在请注意,生成这六个后缀排列的部分M(ES|SE)|E(MS|SM)|S(ME|EM)可以计算出来并应用前缀JA以及AJ

(AJ|JA)(M(ES|SE)|E(MS|SM)|S(ME|EM))

看看,我们现在有12场比赛。这已经是排列空间的10%。这里有一个模式,我们已经对字符串进行了特定的排列,即JAMES并将其(任意地)分成两部分,即JAMES。然后我们分别将这些片段置换为(AJ|JA)(两种方式)和M(ES|SE)|E(MS|SM)|S(ME|EM)(六种方式),连接它们,我们有十二种。

我们不能重复这十次并获得全部120次吗?我们当然可以!有五种方法可以从一组5中选择两个字母:

JA,JM,JE,JS,AM,AE,AS,ME,MS,ES

其中每一种都有两种匹配方式,因此有二十种可能性。这些中的每一个都与剩余字母的六个排列相匹配:20 * 6 = 120.对于每个有向图,我们根据上述模式写下一个正则表达式,其中包含120个中的12个:

(AJ|JA)(M(ES|SE)|E(MS|SM)|S(ME|EM))
(JM|MJ)(A(ES|SE)|E(AS|SA)|S(EA|AE))
... eight more

使用&#39; |&#39;而且你已经完成了。长度为10 * 35 + 9 = 359个字符。

这远小于原始的未经压缩的正则表达式,长度为719个字符。

事实上,719是2 * 359 + 1. :)。

补充方法:

在TXR中,有一个trie模块用于某些过滤功能。我们可以做的是将JAMES的所有排列添加到trie中,然后将其转换为trie压缩的正则表达式。但是,转换为正则表达式的功能并不存在。没关系,我们破解了它。该函数将trie转换为正则表达式抽象语法:由Lisp s表达式构成的树。然后我们可以用regex-compile编译它。作为副作用,生成的对象具有以正则表达式字符语法呈现的打印表示(这是我们最终在此之后:看到该表单):

;; This produces S-exp based regex abstract syntax (AST) not
;; regex character syntax.
(defun trie-to-regex (trie)
  (typecase trie
    (null nil)
    (cons ^(compound ,(car trie) ,(trie-to-regex (cdr trie))))
    (hash (iflet ((pairs (hash-pairs trie)))
            (reduce-left (ret ^(or ,@1 ,@2))
                         (mapcar (aret ^(compound ,@1 ,(trie-to-regex @2)))
                                 pairs))))))

好的,我们可以这样做:

$ txr -i trie-to-regex.tl
1> (defvar tr (make-trie))
tr
2> (perm "JAMES")
("JAMES" "JAMSE" "JAEMS" "JAESM" "JASME" "JASEM" "JMAES" "JMASE"
 "JMEAS" "JMESA" "JMSAE" "JMSEA" "JEAMS" "JEASM" "JEMAS" "JEMSA"
 "JESAM" "JESMA" "JSAME" "JSAEM" "JSMAE" "JSMEA" "JSEAM" "JSEMA"
 "AJMES" "AJMSE" "AJEMS" "AJESM" "AJSME" "AJSEM" "AMJES" "AMJSE"
 "AMEJS" "AMESJ" "AMSJE" "AMSEJ" "AEJMS" "AEJSM" "AEMJS" "AEMSJ"
 "AESJM" "AESMJ" "ASJME" "ASJEM" "ASMJE" "ASMEJ" "ASEJM" "ASEMJ"
 "MJAES" "MJASE" "MJEAS" "MJESA" "MJSAE" "MJSEA" "MAJES" "MAJSE"
 "MAEJS" "MAESJ" "MASJE" "MASEJ" "MEJAS" "MEJSA" "MEAJS" "MEASJ"
 "MESJA" "MESAJ" "MSJAE" "MSJEA" "MSAJE" "MSAEJ" "MSEJA" "MSEAJ"
 "EJAMS" "EJASM" "EJMAS" "EJMSA" "EJSAM" "EJSMA" "EAJMS" "EAJSM"
 "EAMJS" "EAMSJ" "EASJM" "EASMJ" "EMJAS" "EMJSA" "EMAJS" "EMASJ"
 "EMSJA" "EMSAJ" "ESJAM" "ESJMA" "ESAJM" "ESAMJ" "ESMJA" "ESMAJ"
 "SJAME" "SJAEM" "SJMAE" "SJMEA" "SJEAM" "SJEMA" "SAJME" "SAJEM"
 "SAMJE" "SAMEJ" "SAEJM" "SAEMJ" "SMJAE" "SMJEA" "SMAJE" "SMAEJ"
 "SMEJA" "SMEAJ" "SEJAM" "SEJMA" "SEAJM" "SEAMJ" "SEMJA" "SEMAJ")
3> (mapdo (op trie-add tr @1 t) (perm "JAMES")) ;; add above to trie
nil
4> (regex-compile (trie-to-regex tr)) ;; compile, get printed rep as side effect
#/A(E(J(MS|SM)|M(JS|SJ)|S(JM|MJ))|J(E(MS|SM)|M(ES|SE)|S(EM|ME))|M(E(JS|SJ)|J(ES|SE)|S(EJ|JE))|S(E(JM|MJ)|J(EM|ME)|M(EJ|JE)))|
E(A(J(MS|SM)|M(JS|SJ)|S(JM|MJ))|J(A(MS|SM)|M(AS|SA)|S(AM|MA))|M(A(JS|SJ)|J(AS|SA)|S(AJ|JA))|S(A(JM|MJ)|J(AM|MA)|M(AJ|JA)))|
J(A(E(MS|SM)|M(ES|SE)|S(EM|ME))|E(A(MS|SM)|M(AS|SA)|S(AM|MA))|M(A(ES|SE)|E(AS|SA)|S(AE|EA))|S(A(EM|ME)|E(AM|MA)|M(AE|EA)))|
M(A(E(JS|SJ)|J(ES|SE)|S(EJ|JE))|E(A(JS|SJ)|J(AS|SA)|S(AJ|JA))|J(A(ES|SE)|E(AS|SA)|S(AE|EA))|S(A(EJ|JE)|E(AJ|JA)|J(AE|EA)))|
S(A(E(JM|MJ)|J(EM|ME)|M(EJ|JE))|E(A(JM|MJ)|J(AM|MA)|M(AJ|JA))|J(A(EM|ME)|E(AM|MA)|M(AE|EA))|M(A(EJ|JE)|E(AJ|JA)|J(AE|EA)))/

trie压缩的正则表达式在原始字符数量方面很大:超过600。

答案 3 :(得分:1)

补充答案:一些代码通过将输入字符串分成三字符来生成压缩正则表达式,三元组分为六个排列。

(defun break-trigraphs (string)
  (if (<= (length string) 3)
    string
    (mapcar (ret (list @1 (break-trigraphs (set-diff string @1))))
            (comb string 3))))

(defun trigraph-tree-to-regex (dtree)
  (typecase dtree
    (str (caseql (length dtree)
           (1 dtree)
           ((2 3) (reduce-right (ret ^(or ,@1 ,@2)) (perm dtree)))
           (t (error "bad trigraph tree"))))
    (cons
      (whenlet ((exprs (collect-each ((elem dtree))
                         ^(compound ,(trigraph-tree-to-regex (first elem))
                                    ,(trigraph-tree-to-regex (second elem))))))
        (reduce-right (ret ^(or ,@1 ,@2)) exprs)))))

REPL互动:

$ txr -i trigraphs.tl 
1> (break-trigraphs "JAMES")
(("JAM" "ES") ("JAE" "MS") ("JAS" "ME") ("JME" "AS") ("JMS" "AE")
 ("JES" "AM") ("AMS" "JE") ("AES" "JM") ("MES" "JA"))
2> (trigraph-tree-to-regex *1)
(or (compound (or "JAM" (or "JMA" (or "AJM" (or "AMJ" (or "MJA" "MAJ")))))
     (or "ES" "SE"))
  (or (compound (or "JAE" (or "JEA" (or "AJE" (or "AEJ" (or "EJA" "EAJ")))))
       (or "MS" "SM"))
    (or (compound (or "JAS" (or "JSA" (or "AJS" (or "ASJ" (or "SJA" "SAJ")))))
         (or "ME" "EM"))
      (or (compound (or "JME" (or "JEM" (or "MJE" (or "MEJ" (or "EJM" "EMJ")))))
           (or "AS" "SA"))
        (or (compound (or "JMS" (or "JSM" (or "MJS" (or "MSJ" (or "SJM" "SMJ")))))
             (or "AE" "EA"))
          (or (compound (or "JES" (or "JSE" (or "EJS" (or "ESJ" (or "SJE" "SEJ")))))
               (or "AM" "MA"))
            (or (compound (or "AMS" (or "ASM" (or "MAS" (or "MSA" (or "SAM" "SMA")))))
                 (or "JE" "EJ"))
              (or (compound (or "AES" (or "ASE" (or "EAS" (or "ESA" (or "SAE" "SEA")))))
                   (or "JM" "MJ"))
                (compound (or "MES" (or "MSE" (or "EMS" (or "ESM" (or "SME" "SEM")))))
                 (or "JA" "AJ"))))))))))
3> (regex-compile *2)
#/(JAM|JMA|AJM|AMJ|MJA|MAJ)(ES|SE)|(JAE|JEA|AJE|AEJ|EJA|EAJ)(MS|SM)|(JAS|JSA|AJS|ASJ|SJA|SAJ)(ME|EM)|(JME|JEM|MJE|MEJ|EJM|EMJ)(AS|SA)|(JMS|JSM|MJS|MSJ|SJM|SMJ)(AE|EA)|(JES|JSE|EJS|ESJ|SJE|SEJ)(AM|MA)|(AMS|ASM|MAS|MSA|SAM|SMA)(JE|EJ)|(AES|ASE|EAS|ESA|SAE|SEA)(JM|MJ)|(MES|MSE|EMS|ESM|SME|SEM)(JA|AJ)/

压缩到329个字符。