在shell脚本中标记之间提取数据并存储在数组中?

时间:2017-03-24 03:37:01

标签: bash shell

我使用bash脚本从网址获取值,并以 html标记形式返回值。输入:

<tr><td title='The name of the health check service.'>hc.name</td><td data-type='java.lang.String'>Replication Queue</td></tr>
<tr><td title='The persistence identifier of the service.'>service.pid</td><td data-type='java.lang.String'>com.adobe.granite.replication.hc.impl.ReplicationQueueHealthCheck</td></tr>
<tr><td title='The health check result'>ok</td><td data-type='java.lang.Boolean'>true</td></tr>
<tr><td title='The health check status'>status</td><td data-type='java.lang.String'>OK</td></tr>
<tr><td title='The elapsed time in miliseconds'>elapsedTime</td><td data-type='java.lang.Long'>18</td></tr>
<tr><td title='The date when the execution finished'>finishedAt</td><td data-type='java.util.Date'>2017-03-24T00:23:36+0530</td></tr>
<tr><td title='Indicates of the execution timed out'>timedOut</td><td data-type='java.lang.Boolean'>false</td></tr>

所需的输出应存储在一个变量中,其中的值来自上述代码中的<td>标记:

values=( ["hc.name"]="Replication Queue" ["status"]="OK")

我尝试使用此sed代码,但仅当多个<td></td>代码位于不同的行时才有效。在这种情况下,多个<td></td>位于同一行。

sed -n 's:.*<td>(.*)</td>.*:\1:p' sample.txt 

该命令仅显示如下输入的结果:

<tr>
<td>ok</td>
<td>status</td>
</tr>

2 个答案:

答案 0 :(得分:0)

我认为使用Perl正则表达式会有更好的运气,因为它们支持非贪婪的匹配。这是一个Perl单行程序,用于打印文件中的信息:

perl -ne 'm:.*?<td [^>]*>(.*?)</td>.*?<td [^>]*>(.*?)</td>:; print "[\"$1\"] = \"$2\"\n";' sample.txt

输出:

["hc.name"] = "Replication Queue"
["service.pid"] = "com.adobe.granite.replication.hc.impl.ReplicationQueueHealthCheck"
["ok"] = "true"
["status"] = "OK"
["elapsedTime"] = "18"
["finishedAt"] = "2017-03-24T00:23:36+0530"
["timedOut"] = "false"

这是一个也有效的sed调用,但这不太精确,因为它匹配除><之外的所有字符以接近非贪婪匹配,这在sed中不受支持。

sed -n 's:.*<td [^>]*>\([^<]*\)</td><td [^>]*>\([^<]*\)</td>.*:[\"\1\"] = \"\2\":p' sample.txt

输出:

["hc.name"] = "Replication Queue"
["service.pid"] = "com.adobe.granite.replication.hc.impl.ReplicationQueueHealthCheck"
["ok"] = "true"
["status"] = "OK"
["elapsedTime"] = "18"
["finishedAt"] = "2017-03-24T00:23:36+0530"
["timedOut"] = "false"

答案 1 :(得分:0)

sgrepsed方法(比纯sed更可靠):

sgrep -o'%r"' '">" __ "<"' sample.txt | sed 's/^/["/;s/""/"/;s/"/"]="/2'

输出:

["hc.name"]="Replication Queue"
["service.pid"]="com.adobe.granite.replication.hc.impl.ReplicationQueueHealthCheck"
["ok"]="true"
["status"]="OK"
["elapsedTime"]="18"
["finishedAt"]="2017-03-24T00:23:36+0530"
["timedOut"]="false"