我在我的脚本中循环浏览网址,并从Apache Tika中提取的html代码中获取一块用于进一步处理。
while read p; do curl -s $p | curl -X PUT -T - http://10.0.2.208:9998/tika | head -1000; done < ~/file_with_urls.txt
例如网址:
http://dailycurrant.com/2014/01/02/marijuana-overdoses-kill-37-in-colorado-on-first-day-of-legalization/
http://www.sott.net/article/271748-Father-sentenced-to-6-months-in-jail-for-paying-too-much-child-support
http://www.sunnyskyz.com/blog/79/The-27-Naughtiest-Cats-In-The-World-And-I-Can-t-Stop-Laughing
在shell脚本中,我想执行以下操作:跳过或删除表单中的所有内容[image: some text ],[bookmark: some text ]
[image: USA][image: Map][image: Print][image: Hall and Son][image: Google+][image: FB Share][image: ][image: Email][image: Print this article][image: Discuss on Cassiopaea Forum][image: Pin it][bookmark: comment96580][bookmark: reply18433][bookmark: reply18457][bookmark: reply18484][bookmark: reply18487][bookmark: comment96583][image: Hugh Mann][bookmark: comment96595][image: Animanarchy][bookmark: reply18488][bookmark: comment96610][bookmark: reply18485][bookmark: comment96632][image: Close][image: Loading...] Plain text starts here
出于上述原因,我只需要“纯文本从这里开始”。
我可以使用支持-P选项的GNU grep使用正则表达式来实现(启用PCRE(Perl兼容的正则表达式)支持),类似于推荐的here:
while read p; do curl -s $p | curl -X PUT -T - http://10.0.2.208:9998/tika | head -1000 | grep -Po '_regex that will do the trick_'; done < ~/file_with_urls.txt
答案 0 :(得分:1)
你可以使用这个awk:
str='[image: USA][image: Map][image: Print][image: Hall and Son][image: Google+][image: FB Share][image: ][image: Email][image: Print this article][image: Discuss on Cassiopaea Forum][image: Pin it][bookmark: comment96580][bookmark: reply18433][bookmark: reply18457][bookmark: reply18484][bookmark: reply18487][bookmark: comment96583][image: Hugh Mann][bookmark: comment96595][image: Animanarchy][bookmark: reply18488][bookmark: comment96610][bookmark: reply18485][bookmark: comment96632][image: Close][image: Loading...] Plain text starts here'
awk 'BEGIN{FS="\\[[^]]*\\] *"} {for (i=1; i<=NF; i++) if ($i) print $i}' <<< "$str"
Plain text starts here
此处$str
代表上面给出的长字符串。