拆分包含特殊字符,单词,数字和URL的字符串

时间:2014-10-24 16:02:41

标签: regex python-3.x

我有一个.txt文件,其中包含:

"'the url address i checked is: https://www.google.com/ for 2times and it's awesome!."

解析后,预期输出应为:

['"',"'",'the','url','address','i','checked','is',':','https://www.google.com/','for','2','times','and',"it's",'awesome','!','.','"']

如何拆分此列表以使用re模块获取输出。

我想出了这种模式:

pattern = re.compile(r"\d+|[a-zA-Z]+[a-zA-Z']*|[^\w\s]")

但这也是我的网址分割。 任何人都可以帮忙吗?

1 个答案:

答案 0 :(得分:0)

从某个地方选择一个url正则表达式并在交替中首先进行 仅举例 -

   #  (?!mailto:)(?:(?:https?|ftp)://)?(?:\S+(?::\S*)?@)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))|localhost)(?::\d{2,5})?(?:/[^\s]*)?|\d+|[a-zA-Z]+[a-zA-Z']*|[^\w\s]


   (?! mailto: )
   (?:
        (?: https? | ftp )
        ://
   )?
   (?:
        \S+ 
        (?: : \S* )?
        @
   )?
   (?:
        (?:
             (?:
                  [1-9] \d? 
               |  1 \d\d 
               |  2 [01] \d 
               |  22 [0-3] 
             )
             (?:
                  \.
                  (?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
             ){2}
             (?:
                  \.
                  (?:
                       [1-9] \d? 
                    |  1 \d\d 
                    |  2 [0-4] \d 
                    |  25 [0-4] 
                  )
             )
          |  (?:
                  (?: [a-z\u00a1-\uffff0-9]+ -? )*
                  [a-z\u00a1-\uffff0-9]+ 
             )
             (?:
                  \.
                  (?: [a-z\u00a1-\uffff0-9]+ -? )*
                  [a-z\u00a1-\uffff0-9]+ 
             )*
             (?:
                  \.
                  (?: [a-z\u00a1-\uffff]{2,} )
             )
        )
     |  localhost
   )
   (?: : \d{2,5} )?
   (?: / [^\s]* )?
|  \d+ 
|  [a-zA-Z]+ [a-zA-Z']* 
|  [^\w\s] 

输出:

 ['"',"'",'the','url','address','i','checked','is',':','https://www.google.com/','for','2','times','and',"it's",'awesome','!','.','"']