从网页中提取文字并修剪

时间:2013-04-09 07:28:55

标签: shell trim

wget --output-document=- http://www.tip.it/runescape/grand-exchange-centre 2>/dev/null \
| grep "The Grand Exchange updated" \

将输出如下内容:

<h4 id="gec_update_time">The Grand Exchange updated <span><b>1</b> days, <b>12</b> hours, <b>45</b> minutes and <b>1</b> seconds ago</span></h4>

我的目标是修剪它以便只输出:

1 days, 12 hours, 45 minutes, 1 seconds

不是很好,有什么提示吗?

2 个答案:

答案 0 :(得分:1)

你可以编写一个简短的Ruby脚本:

gem install sanitize

制作名为“cleaner.rb”的文件:

#!/usr/bin/env ruby -w
require 'rubygems'
require 'sanitize'

puts Sanitize.clean(gets).trim

然后......

wget --output-document=- http://www.tip.it/runescape/grand-exchange-centre 2>/dev/null \ | grep "The Grand Exchange updated" | ./cleaner.rb

给你:“The Grand Exchange更新1天,13小时,0分钟和56秒之前”

答案 1 :(得分:1)

如果是使用lynx的选项,你可以免费获得:

$ lynx -dump http://www.tip.it/runescape/grand-exchange-centre | grep "The Grand Exchange updated"
The Grand Exchange updated 1 days, 19 hours, 8 minutes and 48 seconds ago

如果需要,您可以从中删除主要文本:

$ foo="$(lynx -dump http://www.tip.it/runescape/grand-exchange-centre | grep "The Grand Exchange updated")"
$ echo "${foo#*updated }"
1 days, 19 hours, 9 minutes and 8 seconds ago

如果您绝对想要使用wget并去掉标签,可以使用以下内容:

$ wget --output-document=- http://www.tip.it/runescape/grand-exchange-centre 2>/dev/null | grep "The Grand Exchange updated" | sed -e 's/<[^>]\+>//g' -e 's/The Grand Exchange updated //'
1 days, 19 hours, 17 minutes and 2 seconds ago

第一种选择可能是更好的选择。