通过正则表达式匹配消除所有URL替换为通用词URL

时间:2015-03-19 15:38:16

标签: regex linux sed

我是正则表达式匹配的新手。假设我想在用逗号分隔的文本文件中找到所有URL,并用单词“url”替换它们。

user,user,' http://twitpic.com/2y1zl - awww, that\'s a bummer.    you shoulda got david carr of third day to do it. ;d',0   
user,user,'is upset that he can\'t update his facebook by texting it... and might cry as a result  school today also. blah!',0   
user,user,' i dived many times for the ball. http://twitpic.com/2y1zl managed to save 50\%  the rest go out of bounds',0  
user,user,'my whole body feels itchy and like its on fire ',0  
user,user,' no, it\'s not behaving at all. i\'m mad. why am i here? because i can\'t see you all over there. ',0  
user,user,' not the whole crew ',0   
user,user,'need a hug ',0   
user,user,' hey  long time no see! yes.. rains a bit ,only a bit  lol , i\'m fine thanks , how\'s you ?',0    
user,user,'_k nope they didn\'t have it ',0   
user,user,'que me muera ? ',0   
user,user,'spring break in plain city... it\'s snowing ',0  
user,user,'i just re-pierced my ears ',0   

希望以这种方式实现输出

user,user,' *url*- awww, that\'s a bummer.    you shoulda got david carr of third day to do it. ;d',0   
user,user,'is upset that he can\'t update his facebook by texting it... and might cry as a result  school today also. blah!',0   
user,user,' i dived many times for the ball. *url* managed to save 50\%  the rest go out of bounds',0  
user,user,'my whole body feels itchy and like its on fire ',0  
user,user,' no, it\'s not behaving at all. i\'m mad. why am i here? because i can\'t see you all over there. ',0  
user,user,' not the whole crew ',0   
user,user,'need a hug ',0   
user,user,' hey  long time no see! yes.. rains a bit ,only a bit  lol , i\'m fine thanks , how\'s you ?',0    
user,user,'nope they didn\'t have it ',0   
user,user,'que me muera ? ',0   
user,user,'spring break in plain city... it\'s snowing ',0  
user,user,'i just re-pierced my ears ',0   

我试过sed

sed -e 's/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$//URL/' filename.txt  |less

查找和替换正则表达式不起作用

2 个答案:

答案 0 :(得分:0)

默认的GNU sed正则表达式需要大量的反斜杠(ref:https://www.gnu.org/software/gnulib/manual/html_node/Regular-expression-syntaxes.html#Regular-expression-syntaxes)。此外,sed正则表达式不理解perl \d\w

匹配网址是一个非常难的问题。从

开始
sed  's@https\?://[^[:blank:]]\+@*url*@g' file

这为s///命令使用了一个备用分隔符,以避免需要转义斜杠。

答案 1 :(得分:0)

如果您的网址与空格后面的任何内容或网址中不存在的任何内容分开,则此操作应该有效。

我在这里没有处理非http网址或用户/密码组合;只需一个http / https后跟一系列字符,允许在URL中使用。

sed -e 's@https\?://[][0-9a-Z._~:/?#@!$&()*+,;=%'\''-]\+@URL@g' 
  • 我使用@作为分隔符,以便于处理斜杠。
  • 由于允许方括号和短划线,我将它们分别直接放在字符类的开头和结尾。
  • 要抓住单引号,必须首先按字面插入然后在内部转义,因此它最终成为:'\''