使用正则表达式在shell脚本中删除嘈杂的html提取文本片段

时间:2014-08-12 18:50:50

标签: regex bash shell

我在我的脚本中循环浏览网址,并从Apache Tika中提取的html代码中获取一块用于进一步处理。

    while read p; do   curl -s $p | curl -X PUT -T - http://10.0.2.208:9998/tika | head -1000; done < ~/file_with_urls.txt

例如网址:

    http://dailycurrant.com/2014/01/02/marijuana-overdoses-kill-37-in-colorado-on-first-day-of-legalization/
    http://www.sott.net/article/271748-Father-sentenced-to-6-months-in-jail-for-paying-too-much-child-support
    http://www.sunnyskyz.com/blog/79/The-27-Naughtiest-Cats-In-The-World-And-I-Can-t-Stop-Laughing

在shell脚本中,我想执行以下操作:跳过或删除表单中的所有内容[image: some text ],[bookmark: some text ]

   [image: USA][image: Map][image: Print][image: Hall and Son][image: Google+][image: FB Share][image: ][image: Email][image: Print this article][image: Discuss on Cassiopaea Forum][image: Pin it][bookmark: comment96580][bookmark: reply18433][bookmark: reply18457][bookmark: reply18484][bookmark: reply18487][bookmark: comment96583][image: Hugh Mann][bookmark: comment96595][image: Animanarchy][bookmark: reply18488][bookmark: comment96610][bookmark: reply18485][bookmark: comment96632][image: Close][image: Loading...] Plain text starts here

出于上述原因,我只需要“纯文本从这里开始”。

我可以使用支持-P选项的GNU grep使用正则表达式来实现(启用PCRE(Perl兼容的正则表达式)支持),类似于推荐的here

    while read p; do   curl -s $p | curl -X PUT -T - http://10.0.2.208:9998/tika | head -1000 | grep -Po '_regex that will do the trick_'; done < ~/file_with_urls.txt

1 个答案:

答案 0 :(得分:1)

你可以使用这个awk:

str='[image: USA][image: Map][image: Print][image: Hall and Son][image: Google+][image: FB Share][image: ][image: Email][image: Print this article][image: Discuss on Cassiopaea Forum][image: Pin it][bookmark: comment96580][bookmark: reply18433][bookmark: reply18457][bookmark: reply18484][bookmark: reply18487][bookmark: comment96583][image: Hugh Mann][bookmark: comment96595][image: Animanarchy][bookmark: reply18488][bookmark: comment96610][bookmark: reply18485][bookmark: comment96632][image: Close][image: Loading...] Plain text starts here'
awk 'BEGIN{FS="\\[[^]]*\\] *"} {for (i=1; i<=NF; i++) if ($i) print $i}' <<< "$str"
Plain text starts here

此处$str代表上面给出的长字符串。