如何在两个字符串之间grep?

时间:2018-11-11 23:34:57

标签: html json bash grep

我有一个文本文件:

<span class="html-tag">&lt;script&gt;</span></td></tr><tr><td class="line-number" value="1431"></td><td class="line-content">            var awbManifests = {"requestId":"16d1-4451-9b12-f61a87e9cd11","errorMessage":null,"errorCode":null,"success":true,"content":[{"id":"5ec8-444e-9d5b-f7487ce592c2","storeId":"10001","createdDate":1541923869937,"createdBy":"asdf","updatedDate":1541968417296,"updatedBy":"dsa","type":"airwaybill","value":"5468468464568466","logisticTrackingID":"5468468464568466","senderName":"dasdf","senderAddress":"Batuceper","receiverName":"ATIK","receiverAddress":"JL. SRIKATON BARAT\n","manifestList":[{"logisticProviderCode":"asd","blibliAirwayBillNumber":"5468468464568466","status":"DEPARTED FROM TRANSIT [GATEWAY JAKARTA]","timestamp":1541976677000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"RECEIVED AT ORIGIN GATEWAY [GATEWAY JAKARTA]","timestamp":1541976343000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"PROCESSED AT SORTING CENTER [JAKARTA]","timestamp":1541968348000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"RECEIVED AT SORTING CENTER [JAKARTA]","timestamp":1541960930000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"SHIPMENT RECEIVED BY asdf COUNTER OFFICER AT [JAKARTA]","timestamp":1541926728000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]}]}],"pageMetaData":null};</td></tr><tr><td class="line-number" value="1432"></td><td class="line-content">            var ordersTracking = [{"orderItemId":"53000116530","product":null,"shipment":"asdf","airwaybillNumber":"5468468464568466","receiver":null,"receivedDate":null,"relation":null,"status":"Valid","productType":"Regular","eligibleForFeedback":false,"feedback":null,"invalidAWBJiraNumber":"","mismatchAWBJiraNumber":"","isAirwayBillValid":true,"mismatchAirwayBill":false}];

我想从var awbManifests =直到第一个;符号获取结果,因此输出应仅是这样的JSON格式:

{"requestId":"16d1-4451-9b12-f61a87e9cd11","errorMessage":null,"errorCode":null,"success":true,"content":[{"id":"5ec8-444e-9d5b-f7487ce592c2","storeId":"10001","createdDate":1541923869937,"createdBy":"asdf","updatedDate":1541968417296,"updatedBy":"dsa","type":"airwaybill","value":"5468468464568466","logisticTrackingID":"5468468464568466","senderName":"dasdf","senderAddress":"Batuceper","receiverName":"ATIK","receiverAddress":"JL. SRIKATON BARAT\n","manifestList":[{"logisticProviderCode":"asd","blibliAirwayBillNumber":"5468468464568466","status":"DEPARTED FROM TRANSIT [GATEWAY JAKARTA]","timestamp":1541976677000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"RECEIVED AT ORIGIN GATEWAY [GATEWAY JAKARTA]","timestamp":1541976343000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"PROCESSED AT SORTING CENTER [JAKARTA]","timestamp":1541968348000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"RECEIVED AT SORTING CENTER [JAKARTA]","timestamp":1541960930000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"SHIPMENT RECEIVED BY asdf COUNTER OFFICER AT [JAKARTA]","timestamp":1541926728000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]}]}],"pageMetaData":null}

到目前为止,我只能执行此操作,但是此命令不能对所有json字符串进行grep:

grep -o -P '(?<=var awbManifests = ).*(?=pageMetaData)' test.html

我该如何解决?

3 个答案:

答案 0 :(得分:1)

Please don't use regular expressions to parse html。如果您问我,使用{@ 3}}这样的可解析HTML / XML JSON的工具会更好。

解析相关的<td>元素节点:

xidel -s input.htm -e '//td[@value="1431"]/following-sibling::td'
 var awbManifests = {"requestId":"16d1-4451-9b12-f61a87e9cd11",[...],"pageMetaData":null};

隔离JSON:

xidel -s input.htm -e '
  //td[@value="1431"]/substring-before(substring-after(following-sibling::td,"awbManifests = "),";")
'
#or
xidel -s input.htm -e '
  //td[@value="1431"]/extract(following-sibling::td,"awbManifests = (.+);",1)
'
{"requestId":"16d1-4451-9b12-f61a87e9cd11",[...],"pageMetaData":null}

解析JSON:

xidel -s input.htm -e '
  json(
    //td[@value="1431"]/substring-before(substring-after(following-sibling::td,"awbManifests = "),";")
  )
'
{
  "requestId": "16d1-4451-9b12-f61a87e9cd11",
  "errorMessage": null,
  "errorCode": null,
  "success": true,
  "content": [...],
  "pageMetaData": null
}

答案 1 :(得分:0)

我花了一段时间才了解正在发生的事情。之所以有趣,是因为您在正则表达式中使用了lookbehinds((?<=))和lookaheads,而这些是我几乎从未使用过的非常有用的构造。

先行/后向的原理是look *组内部的字符串是匹配的,但在匹配的字符串中不存在。这对于grep -o非常有用。 (?<=var awbManifests = )的后向标记已正确使用,但前行(?=pageMetaData)应该为(?=;)。但是现在,您遇到的问题是正则表达式匹配太多文本。

默认情况下,正则表达式量词+*{n,m}是贪婪的,这意味着它们会尝试匹配尽可能多的文本。在perl模式(-P中,非贪婪量词在grep中可用。它们的语法是+?的{​​{1}},+的{​​{1}}和*?的{​​{1}}。

在这里应用,要使用的正则表达式为:

*

但是仍然存在一个问题:如果其中一个JSON字符串包含{n,m}?,则JSON将不会完全匹配。要说明字符串,请改用:

{n,m}

在上述正则表达式中:

  • grep -o -P '(?<=var awbManifests = ).*?(?=;)' test.html 除双引号外都匹配。
  • ;反斜杠必须转义,因为它们在正则表达式中具有含义。
  • grep -o -P '(?<=var awbManifests = )[^"]*("([^"]|\\")*"[^"]*?)*?(?=;)' test.html 匹配一个字符串,并考虑了转义的引号。
  • [^"]匹配多个字符串,其前后带有文本。此模式与完整的JSON匹配,但很贪心。
  • \\以非贪婪的方式匹配JSON。

答案 2 :(得分:0)

借助Perl,您还可以使用非贪婪量词来提取它们

> cat regex_jsoon.dat
<span class="html-tag">&lt;script&gt;</span></td></tr><tr><td class="line-number" value="1431"></td><td class="line-content">            var awbManifests = {"requestId":"16d1-4451-9b12-f61a87e9cd11","errorMessage":null,"errorCode":null,"success":true,"content":[{"id":"5ec8-444e-9d5b-f7487ce592c2","storeId":"10001","createdDate":1541923869937,"createdBy":"asdf","updatedDate":1541968417296,"updatedBy":"dsa","type":"airwaybill","value":"5468468464568466","logisticTrackingID":"5468468464568466","senderName":"dasdf","senderAddress":"Batuceper","receiverName":"ATIK","receiverAddress":"JL. SRIKATON BARAT\n","manifestList":[{"logisticProviderCode":"asd","blibliAirwayBillNumber":"5468468464568466","status":"DEPARTED FROM TRANSIT [GATEWAY JAKARTA]","timestamp":1541976677000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"RECEIVED AT ORIGIN GATEWAY [GATEWAY JAKARTA]","timestamp":1541976343000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"PROCESSED AT SORTING CENTER [JAKARTA]","timestamp":1541968348000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"RECEIVED AT SORTING CENTER [JAKARTA]","timestamp":1541960930000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"SHIPMENT RECEIVED BY asdf COUNTER OFFICER AT [JAKARTA]","timestamp":1541926728000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]}]}],"pageMetaData":null};</td></tr><tr><td class="line-number" value="1432"></td><td class="line-content">            var ordersTracking = [{"orderItemId":"53000116530","product":null,"shipment":"asdf","airwaybillNumber":"5468468464568466","receiver":null,"receivedDate":null,"relation":null,"status":"Valid","productType":"Regular","eligibleForFeedback":false,"feedback":null,"invalidAWBJiraNumber":"","mismatchAWBJiraNumber":"","isAirwayBillValid":true,"mismatchAirwayBill":false}];

> perl -ne ' { s/.*var awbManifests = (.*?);.*/\1/g; print } ' regex_jsoon.dat
{"requestId":"16d1-4451-9b12-f61a87e9cd11","errorMessage":null,"errorCode":null,"success":true,"content":[{"id":"5ec8-444e-9d5b-f7487ce592c2","storeId":"10001","createdDate":1541923869937,"createdBy":"asdf","updatedDate":1541968417296,"updatedBy":"dsa","type":"airwaybill","value":"5468468464568466","logisticTrackingID":"5468468464568466","senderName":"dasdf","senderAddress":"Batuceper","receiverName":"ATIK","receiverAddress":"JL. SRIKATON BARAT\n","manifestList":[{"logisticProviderCode":"asd","blibliAirwayBillNumber":"5468468464568466","status":"DEPARTED FROM TRANSIT [GATEWAY JAKARTA]","timestamp":1541976677000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"RECEIVED AT ORIGIN GATEWAY [GATEWAY JAKARTA]","timestamp":1541976343000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"PROCESSED AT SORTING CENTER [JAKARTA]","timestamp":1541968348000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"RECEIVED AT SORTING CENTER [JAKARTA]","timestamp":1541960930000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"SHIPMENT RECEIVED BY asdf COUNTER OFFICER AT [JAKARTA]","timestamp":1541926728000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]}]}],"pageMetaData":null}

>

或另外一个没有替代的版本。

> perl -ne ' { print "$x\n"  if /var awbManifests = (.*?);/osg  and $x=$1 } ' regex_jsoon.dat
{"requestId":"16d1-4451-9b12-f61a87e9cd11","errorMessage":null,"errorCode":null,"success":true,"content":[{"id":"5ec8-444e-9d5b-f7487ce592c2","storeId":"10001","createdDate":1541923869937,"createdBy":"asdf","updatedDate":1541968417296,"updatedBy":"dsa","type":"airwaybill","value":"5468468464568466","logisticTrackingID":"5468468464568466","senderName":"dasdf","senderAddress":"Batuceper","receiverName":"ATIK","receiverAddress":"JL. SRIKATON BARAT\n","manifestList":[{"logisticProviderCode":"asd","blibliAirwayBillNumber":"5468468464568466","status":"DEPARTED FROM TRANSIT [GATEWAY JAKARTA]","timestamp":1541976677000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"RECEIVED AT ORIGIN GATEWAY [GATEWAY JAKARTA]","timestamp":1541976343000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"PROCESSED AT SORTING CENTER [JAKARTA]","timestamp":1541968348000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"RECEIVED AT SORTING CENTER [JAKARTA]","timestamp":1541960930000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]},{"logisticProviderCode":"asdf","blibliAirwayBillNumber":"5468468464568466","status":"SHIPMENT RECEIVED BY asdf COUNTER OFFICER AT [JAKARTA]","timestamp":1541926728000,"additionalInfo":[{"label":"Third Party Tracking ID","value":null,"type":"STRING","description":"Third Party Tracking ID"}]}]}],"pageMetaData":null}
>