在文件中查找一行,解析>之间的内容和&lt ;,然后在之前或之后添加三行

时间:2017-07-10 01:32:39

标签: bash awk sed gawk

我需要编辑一个包含数千个节的kml文件,如下所示。我可以绕过逻辑,但实际的实现超出了我的范围。

程序上我需要:

  1. 找到包含Sub_Name
  2. 的行
  3. 解析该行之间的内容>和<
  4. 在找到该行(或文件)之前添加该内容4行
  5. 洗涤重复冲洗
  6. 我觉得我应该能够用bash脚本和一些适度彻底的sed和awk命令来做到这一点但是我开始筑巢所有的陨石坑。

      <Placemark>
     <name>THIS LINE NEEDS TO BE ADDED FROM THE Sub_Name LINE</name>
        <Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
        <ExtendedData><SchemaData schemaUrl="#gmaps">
                <SimpleData name="EntID">1274433</SimpleData>
                <SimpleData name="Sub_Name">HYDE PARK</SimpleData>
                <SimpleData name="ORIG_FID">39</SimpleData>
                <SimpleData name="Scode">S5435</SimpleData>
                <SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
                <SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
        </SchemaData></ExtendedData>
      <MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>
    

    这与this问题非常相似,但我已经解析了一个小时但无法使其符合我的情况。

    感谢您提供的任何建议和指导。

2 个答案:

答案 0 :(得分:3)

简单的方法就是两次通过:

$ cat tst.awk
NR==FNR {
    if ( /Sub_Name/ ) {
        gsub(/[[:space:]]*<[^<>]+>/,"")
        names[NR-4] = ORS "<name>" $0 "</name>"
    }
    next
}
{ print $0 names[FNR] }

$ awk -f tst.awk file file
  <Placemark>
<name>HYDE PARK</name>
    <Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
    <ExtendedData><SchemaData schemaUrl="#gmaps">
            <SimpleData name="EntID">1274433</SimpleData>
            <SimpleData name="Sub_Name">HYDE PARK</SimpleData>
            <SimpleData name="ORIG_FID">39</SimpleData>
            <SimpleData name="Scode">S5435</SimpleData>
            <SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
            <SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
    </SchemaData></ExtendedData>
  <MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>

以上内容来自此输入文件:

$ cat file
  <Placemark>
    <Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
    <ExtendedData><SchemaData schemaUrl="#gmaps">
            <SimpleData name="EntID">1274433</SimpleData>
            <SimpleData name="Sub_Name">HYDE PARK</SimpleData>
            <SimpleData name="ORIG_FID">39</SimpleData>
            <SimpleData name="Scode">S5435</SimpleData>
            <SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
            <SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
    </SchemaData></ExtendedData>
  <MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>

稍微困难的方法是保持4行的滚动缓冲区并始终打印第4行读取,但只有当您的输入来自管道或您的文件太大时才需要#&# 39; t得到时间解析它两次或记忆存储所有&#34; name&#34;数组中的行。

关于在没有HTML解析器的情况下尝试解析HTML的危险的常见警告适用...

答案 1 :(得分:0)

假设:

$ cat xml_file
 <Placemark>
    <Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
    <ExtendedData><SchemaData schemaUrl="#gmaps">
            <SimpleData name="EntID">1274433</SimpleData>
            <SimpleData name="Sub_Name">HYDE PARK</SimpleData>
            <SimpleData name="ORIG_FID">39</SimpleData>
            <SimpleData name="Scode">S5435</SimpleData>
            <SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
            <SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
    </SchemaData></ExtendedData>
  <MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>
   </Placemark>

如果您想要解析该XML并使用xpath来查找嵌套子节点的值并添加另一个节点,您可以沿着这些方向做一些事情(例如Ruby) :

$ ruby -r nokogiri -e 'doc=Nokogiri::XML($<.read) # {|opt| opt.strict.noblanks }   
    t1=doc.at_css "Placemark"
    t2 = Nokogiri::XML::Node.new "name", doc
    t2.parent=t1
    t2.content=doc.xpath("//SimpleData[@name=\"Sub_Name\"]").text
    puts doc
' xml_file

打印:

<?xml version="1.0"?>
<Placemark>
    <Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
    <ExtendedData><SchemaData schemaUrl="#gmaps">
            <SimpleData name="EntID">1274433</SimpleData>
            <SimpleData name="Sub_Name">HYDE PARK</SimpleData>
            <SimpleData name="ORIG_FID">39</SimpleData>
            <SimpleData name="Scode">S5435</SimpleData>
            <SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
            <SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
    </SchemaData></ExtendedData>
  <MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>
   <name>HYDE PARK</name></Placemark>

(请注意,插入的节点<name>HYDE PARK</name>位于<Placemark>节点的末尾,因为架构未指定XML顺序。)

使用XML解析器的任何其他脚本语言都是类似的(Ruby,Python,Perl,jq等)