我有一个 text 文件,其中包含以下格式的数据。这是它包含的数据样本。该文件正确且格式正确:
<node id="1647008557" lat="36.6536840" lon="-121.7938995" version="1" timestam p="2012-02-25T14:03:54Z" changeset="10787766" uid="294728" user="skew-t">
<tag k="highway" v="turning_circle"/>
</node>
<way id="10459706" version="2" timestamp="2010-03-27T18:21:32Z" changeset="4247030" uid="20587" user="balrog-kun">
<nd ref="89705976"/>
<nd ref="89798118"/>
<nd ref="89798120"/>
<nd ref="89798122"/>
<nd ref="89798124"/>
<nd ref="89798126"/>
<nd ref="89798128"/>
<nd ref="89798130"/>
<tag k="highway" v="residential"/>
<tag k="name" v="Engineer Road"/>
<tag k="tiger:cfcc" v="A41"/>
<tag k="tiger:county" v="Livingston, CA"/>
<tag k="tiger:name_base" v="Engineer"/>
<tag k="tiger:name_type" v="Rd"/>
<tag k="tiger:reviewed" v="no"/>
<tag k="tiger:separated" v="no"/>
<tag k="tiger:source" v="tiger_import_dch_v0.6_20070809"/>
<tag k="tiger:tlid" v="196844016"/>
</way>
<way id="10461171" version="3" timestamp="2014-01-07T00:17:59Z" changeset="19855176" uid="1871178" user="RBoggs">
<nd ref="89804458"/>
<nd ref="89804460"/>
<nd ref="89804463"/>
<nd ref="89804464"/>
<nd ref="89804466"/>
<nd ref="89804468"/>
<tag k="access" v="no"/>
<tag k="highway" v="residential"/>
<tag k="motor_vehicle" v="no"/>
<tag k="name" v="5th Cutoff Street"/>
<tag k="tiger:cfcc" v="A41"/>
<tag k="tiger:county" v="Marysville, CA"/>
<tag k="tiger:name_base" v="5th Cutoff"/>
<tag k="tiger:name_type" v="St"/>
<tag k="tiger:reviewed" v="no"/>
</way>
<way id="151860745" version="1" timestamp="2012-02-25T14:03:59Z" changeset="10787766" uid="294728" user="skew-t">
<nd ref="1647008614"/>
<nd ref="1647008545"/>
<nd ref="1647008605"/>
<nd ref="1647008555"/>
<nd ref="1647008557"/>
<tag k="highway" v="service"/>
</way>
我正在尝试打印name
部分中的way id
以及way id
本身,nd ref
所在的序列号以及{{ 1}} id。
在正确输出中就像这样:
nd ref
如何通过忽略$ awk -f table.awk file.txt | head
road,way_id,seq_num,node_ref_id
Engineer Road,10459706,1,89705976
Engineer Road,10459706,2,89798118
Engineer Road,10459706,3,89798120
Engineer Road,10459706,4,89798122
Engineer Road,10459706,5,89798124
Engineer Road,10459706,6,89798126
Engineer Road,10459706,7,89798128
Engineer Road,10459706,8,89798130
5th Cutoff Street,10461171,1,89804458
5th Cutoff Street,10461171,2,89804460
5th Cutoff Street,10461171,3,89804463
5th Cutoff Street,10461171,4,89804464
5th Cutoff Street,10461171,5,89804466
5th Cutoff Street,10461171,6,89804468
标记中不包含<tag k="name"
的行来打印该输出?
答案 0 :(得分:2)
不要使用awk解析XML / HTML,使用正确的XML / HTML解析器和强大的xpath查询。
根据编译理论,无法使用基于finite state machine的正则表达式解析XML / HTML。由于XML / HTML的层次结构,您需要使用pushdown automaton并使用LALR等工具操作YACC语法。
您可以使用以下其中一项:
xmllint通常默认使用libxml2
,xpath1安装(检查my wrapper以使换行符分隔输出
xmlstarlet可以编辑,选择,转换......默认情况下不安装,xpath1
通过perl的模块XML :: XPath,xpath1 安装xidel xpath3
saxon-lint我自己的项目,包装在@Michael Kay的Saxon-HE Java库中,xpath3
python的lxml
(from lxml import etree
)
perl的XML::LibXML
,XML::XPath
,XML::Twig::XPath
,HTML::TreeBuilder::XPath
ruby nokogiri,check this example
php DOMXpath
,check this example
检查:Using regular expressions with HTML tags
(在OP为破损的XML改变XML之前)
<way id="10459706" version="2" timestamp="2010-03-27T18:21:32Z" changeset="424 7030" uid="20587" user="balrog-kun">
<nd ref="89705976"/>
<nd ref="89798118"/>
<nd ref="89798120"/>
<nd ref="89798122"/>
<nd ref="89798124"/>
<nd ref="89798126"/>
<nd ref="89798128"/>
<nd ref="89798130"/>
<tag k="highway" v="residential"/>
<tag k="name" v="Engineer Road"/>
<tag k="tiger:cfcc" v="A41"/>
<tag k="tiger:county" v="Livingston, CA"/>
<tag k="tiger:name_base" v="Engineer"/>
<tag k="tiger:name_type" v="Rd"/>
<tag k="tiger:reviewed" v="no"/>
<tag k="tiger:separated" v="no"/>
<tag k="tiger:source" v="tiger_import_dch_v0.6_20070809"/>
<tag k="tiger:tlid" v="196844016"/>
</way>
#!/bin/bash
IFS='|' read title id < <(
xmlstarlet sel -t -v '//tag[@k="name"]/@v' -o "|" -v '//way/@id' file
)
xmlstarlet sel -t -v '//nd/@ref' file | while read line; do
echo "$title,$id,$((++c)),$line"
done
Engineer Road,10459706,1,89705976
Engineer Road,10459706,2,89798118
Engineer Road,10459706,3,89798120
Engineer Road,10459706,4,89798122
Engineer Road,10459706,5,89798124
Engineer Road,10459706,6,89798126
Engineer Road,10459706,7,89798128
答案 1 :(得分:0)
“Gilles Quenot”已经告诉过你使用正确的XML / HTML解析器,他提到Xidel就是其中之一。
我已将您的XML文件保存为“ so_49592301.xml ”。
作为字符串的图例很简单:
$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"'
接下来,选择<way>
元素节点,但只选择那些包含属性为<tag>
的{{1}}子节点的节点:
k="name"
接下来,选择$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]'
子节点并对索引和<nd>
属性执行字符串连接,并使用逗号作为分隔符:
ref
请注意,索引不会从下一个$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]/nd/join((position(),@ref),",")'
road,way_id,seq_num,node_ref_id
1,89705976
2,89798118
3,89798120
4,89798122
5,89798124
6,89798126
7,89798128
8,89798130
9,89804458
10,89804460
11,89804463
12,89804464
13,89804466
14,89804468
元素节点重新开始?这可以通过在括号之间加<way>
来轻松解决:
nd/...
接下来,您将包含$ ./xidel -s so_49592301.xml -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]/(nd/join((position(),@ref),","))'
road,way_id,seq_num,node_ref_id
1,89705976
2,89798118
3,89798120
4,89798122
5,89798124
6,89798126
7,89798128
8,89798130
1,89804458
2,89804460
3,89804463
4,89804464
5,89804466
6,89804468
子节点中的v
属性和<tag k="name">
元素节点中的id
属性。但是,您位于<way>
子节点内,因此要包含1级以上的内容,您必须添加<nd>
:
../
并使其更具可读性:
$ ./xidel -s "so_49592301.xml" -e '"road,way_id,seq_num,node_ref_id"' -e '//way[tag[@k="name"]]/(nd/join((../tag[@k="name"]/@v,../@id,position(),@ref),","))'
road,way_id,seq_num,node_ref_id
Engineer Road,10459706,1,89705976
Engineer Road,10459706,2,89798118
Engineer Road,10459706,3,89798120
Engineer Road,10459706,4,89798122
Engineer Road,10459706,5,89798124
Engineer Road,10459706,6,89798126
Engineer Road,10459706,7,89798128
Engineer Road,10459706,8,89798130
5th Cutoff Street,10461171,1,89804458
5th Cutoff Street,10461171,2,89804460
5th Cutoff Street,10461171,3,89804463
5th Cutoff Street,10461171,4,89804464
5th Cutoff Street,10461171,5,89804466
5th Cutoff Street,10461171,6,89804468