AWK - 如何修改我的AWK脚本以忽略文件中不包含匹配模式的行?

时间:2018-03-31 20:52:57

标签: awk

我有一个 text 文件,其中包含以下格式的数据。这是它包含的数据样本。该文件正确且格式正确:

 <node id="1647008557" lat="36.6536840" lon="-121.7938995" version="1" timestam  p="2012-02-25T14:03:54Z" changeset="10787766" uid="294728" user="skew-t">
  <tag k="highway" v="turning_circle"/>
  </node>
  <way id="10459706" version="2" timestamp="2010-03-27T18:21:32Z" changeset="4247030" uid="20587" user="balrog-kun">
    <nd ref="89705976"/>
    <nd ref="89798118"/>
    <nd ref="89798120"/>
    <nd ref="89798122"/>
    <nd ref="89798124"/>
    <nd ref="89798126"/>
    <nd ref="89798128"/>
    <nd ref="89798130"/>
    <tag k="highway" v="residential"/>
    <tag k="name" v="Engineer Road"/>
    <tag k="tiger:cfcc" v="A41"/>
    <tag k="tiger:county" v="Livingston, CA"/>
    <tag k="tiger:name_base" v="Engineer"/>
    <tag k="tiger:name_type" v="Rd"/>
    <tag k="tiger:reviewed" v="no"/>
    <tag k="tiger:separated" v="no"/>
    <tag k="tiger:source" v="tiger_import_dch_v0.6_20070809"/>
    <tag k="tiger:tlid" v="196844016"/>
  </way>
  <way id="10461171" version="3" timestamp="2014-01-07T00:17:59Z" changeset="19855176" uid="1871178" user="RBoggs">
    <nd ref="89804458"/>
    <nd ref="89804460"/>
    <nd ref="89804463"/>
    <nd ref="89804464"/>
    <nd ref="89804466"/>
    <nd ref="89804468"/>
    <tag k="access" v="no"/>
    <tag k="highway" v="residential"/>
    <tag k="motor_vehicle" v="no"/>
    <tag k="name" v="5th Cutoff Street"/>
    <tag k="tiger:cfcc" v="A41"/>
    <tag k="tiger:county" v="Marysville, CA"/>
    <tag k="tiger:name_base" v="5th Cutoff"/>
    <tag k="tiger:name_type" v="St"/>
    <tag k="tiger:reviewed" v="no"/>
    </way>
<way id="151860745" version="1" timestamp="2012-02-25T14:03:59Z" changeset="10787766" uid="294728" user="skew-t">
    <nd ref="1647008614"/>
    <nd ref="1647008545"/>
    <nd ref="1647008605"/>
    <nd ref="1647008555"/>
    <nd ref="1647008557"/>
    <tag k="highway" v="service"/>
  </way>

我正在尝试打印name部分中的way id以及way id本身,nd ref所在的序列号以及{{ 1}} id。

正确输出中就像这样:

nd ref

如何通过忽略$ awk -f table.awk file.txt | head road,way_id,seq_num,node_ref_id Engineer Road,10459706,1,89705976 Engineer Road,10459706,2,89798118 Engineer Road,10459706,3,89798120 Engineer Road,10459706,4,89798122 Engineer Road,10459706,5,89798124 Engineer Road,10459706,6,89798126 Engineer Road,10459706,7,89798128 Engineer Road,10459706,8,89798130 5th Cutoff Street,10461171,1,89804458 5th Cutoff Street,10461171,2,89804460 5th Cutoff Street,10461171,3,89804463 5th Cutoff Street,10461171,4,89804464 5th Cutoff Street,10461171,5,89804466 5th Cutoff Street,10461171,6,89804468 标记中不包含<tag k="name"的行来打印该输出?

2 个答案:

答案 0 :(得分:2)

不要使用awk解析XML / HTML,使用正确的XML / HTML解析器和强大的查询。

理论:

根据编译理论,无法使用基于finite state machine的正则表达式解析XML / HTML。由于XML / HTML的层次结构,您需要使用pushdown automaton并使用LALR等工具操作YACC语法。

中的realLife©®™日常工具:

您可以使用以下其中一项:

xmllint通常默认使用libxml2,xpath1安装(检查my wrapper以使换行符分隔输出

xmlstarlet可以编辑,选择,转换......默认情况下不安装,xpath1

通过perl的模块XML :: XPath,xpath1

安装

xpath

xidel xpath3

saxon-lint我自己的项目,包装在@Michael Kay的Saxon-HE Java库中,xpath3

或者您可以使用高级语言和正确的库,我想:

lxmlfrom lxml import etree

XML::LibXMLXML::XPathXML::Twig::XPathHTML::TreeBuilder::XPath

check this example

DOMXpathcheck this example

检查:Using regular expressions with HTML tags

使用

的示例

根据

使用此功能

档案:

(在OP为破损的XML改变XML之前)

  <way id="10459706" version="2" timestamp="2010-03-27T18:21:32Z" changeset="424 7030" uid="20587" user="balrog-kun">
    <nd ref="89705976"/>
    <nd ref="89798118"/>
    <nd ref="89798120"/>
    <nd ref="89798122"/>
    <nd ref="89798124"/>
    <nd ref="89798126"/>
    <nd ref="89798128"/>
    <nd ref="89798130"/>
    <tag k="highway" v="residential"/>
    <tag k="name" v="Engineer Road"/>
    <tag k="tiger:cfcc" v="A41"/>
    <tag k="tiger:county" v="Livingston, CA"/>
    <tag k="tiger:name_base" v="Engineer"/>
    <tag k="tiger:name_type" v="Rd"/>
    <tag k="tiger:reviewed" v="no"/>
    <tag k="tiger:separated" v="no"/>
    <tag k="tiger:source" v="tiger_import_dch_v0.6_20070809"/>
    <tag k="tiger:tlid" v="196844016"/>
  </way>

代码:

#!/bin/bash

IFS='|' read title id < <(
    xmlstarlet sel -t -v '//tag[@k="name"]/@v' -o "|" -v '//way/@id' file
)
xmlstarlet sel -t -v '//nd/@ref' file | while read line; do
    echo "$title,$id,$((++c)),$line"
done

输出:

Engineer Road,10459706,1,89705976
Engineer Road,10459706,2,89798118
Engineer Road,10459706,3,89798120
Engineer Road,10459706,4,89798122
Engineer Road,10459706,5,89798124
Engineer Road,10459706,6,89798126
Engineer Road,10459706,7,89798128

答案 1 :(得分:0)

“Gilles Quenot”已经告诉过你使用正确的XML / HTML解析器,他提到Xidel就是其中之一。
我已将您的XML文件保存为“ so_49592301.xml ”。

作为字符串的图例很简单:

$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"'

接下来,选择<way>元素节点,但只选择那些包含属性为<tag>的{​​{1}}子节点的节点:

k="name"

接下来,选择$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]' 子节点并对索引和<nd>属性执行字符串连接,并使用逗号作为分隔符:

ref

请注意,索引不会从下一个$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]/nd/join((position(),@ref),",")' road,way_id,seq_num,node_ref_id 1,89705976 2,89798118 3,89798120 4,89798122 5,89798124 6,89798126 7,89798128 8,89798130 9,89804458 10,89804460 11,89804463 12,89804464 13,89804466 14,89804468 元素节点重新开始?这可以通过在括号之间加<way>来轻松解决:

nd/...

接下来,您将包含$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]/(nd/join((position(),@ref),","))' road,way_id,seq_num,node_ref_id 1,89705976 2,89798118 3,89798120 4,89798122 5,89798124 6,89798126 7,89798128 8,89798130 1,89804458 2,89804460 3,89804463 4,89804464 5,89804466 6,89804468 子节点中的v属性和<tag k="name">元素节点中的id属性。但是,您位于<way>子节点内,因此要包含1级以上的内容,您必须添加<nd>

../

并使其更具可读性:

$ ./xidel -s "so_49592301.xml" -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]/(nd/join((../tag[@k="name"]/@v,../@id,position(),@ref),","))'
road,way_id,seq_num,node_ref_id
Engineer Road,10459706,1,89705976
Engineer Road,10459706,2,89798118
Engineer Road,10459706,3,89798120
Engineer Road,10459706,4,89798122
Engineer Road,10459706,5,89798124
Engineer Road,10459706,6,89798126
Engineer Road,10459706,7,89798128
Engineer Road,10459706,8,89798130
5th Cutoff Street,10461171,1,89804458
5th Cutoff Street,10461171,2,89804460
5th Cutoff Street,10461171,3,89804463
5th Cutoff Street,10461171,4,89804464
5th Cutoff Street,10461171,5,89804466
5th Cutoff Street,10461171,6,89804468