这个awk正则表达式替换有什么问题?

时间:2010-07-17 05:52:10

标签: regex awk

我有一个特殊的问题,即使用awk正则表达式匹配替换xml文件中的某些文本。

xml文件很简单。每个xml的节点中都有一段文本,而awk程序将此文本替换为从文本文件rtxt中选取的另一段文本。但由于某种原因,用rtxt(标记为'42')替换42.xml中的文本的文本不会产生正确的替换。

toxml.awk写入stdout。它首先打印xml,因为它已经读取了它,然后是最终替换的结果。

我实际上有一些这些xml文件的集合,我用更长的rtxt中的文本替换。碰巧这个特定的替换(对于42.xml)不起作用。而不是替换元素中的文本,另一个标记嵌套在现有标记中。


toxml.awk

BEGIN{
    srcfile = "rtxt"
    FS = "|"

    while (getline <srcfile) {
    xmlfile = $1 ".xml"
    rep = "<narrative>" $2 "</narrative>"

    ## read in the xml file in one go.
    ## (the last tag will be missing.)
    RS = "</topic>"
    FS = "</topic>"

    getline <xmlfile
    #print $0
    close(xmlfile)

    ## replace
    subs = gsub(/<narrative>.*<\/narrative>/, rep, $0)

    ## append the closing tag
    subs = gsub(/[ \n\r\t]+$/, "\n</topic>", $0)
    print $0

    ## restore them before reading rtxt.
    RS = "\n"
    FS = "|"
    }

    close(srcfile)
}

rtxt

42 |显示Java培训机构的详细信息,以及提供Java解决方案的IT公司也被认为是不相关的。 Java是Sun Microsystems开发的一种流行的编程语言。我很想知道这种编程语言,也学习编程。为了相关,结果应该提供有关Java和Java的历史的信息。在不同版本的Java上,以及Java中的不同概念。如果我找到学习Java的教程,那就太好了。仅与Sun Microsystems相关但不与Java相关的结果被认为是不相关的。我喜欢找到讨论这种编程语言和各种概念的文章。它的版本。


42.xml

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE topic SYSTEM "topic.dtd">
<topic id="2009042" ct_no="227">

  <title>sun java</title>

  <castitle>//article[about(.//language, java) or about(.,sun)]//sec[about(.//language, java)]</castitle>

  <phrasetitle>"sun java"</phrasetitle>

  <description>Find information about Sun Microsystem's Java language</description>

  <narrative>Java is a popular programming language developed at Sun Microsystems. I am interested to know about this programming language, and also to learn programming with it.    To be relevant, a result should give information on history of Java &amp; on different versions of Java, and on different concepts in Java. Its good if I find tutorials for learning Java. Results related only to Sun Microsystems but not Java are considered non-relevant. Results showing details of training institutes for Java, and IT companies which provide Java solutions are also considered non-relevant. I like to find articles that discuss this programming language and various concepts &amp; versions of it.  </narrative>

</topic>

1 个答案:

答案 0 :(得分:0)

只是一个开始

#!/bin/bash

awk 'BEGIN{FS="|"}
FNR==NR{  nar[$1]=$2; next }
END{
  for(i=2;i<ARGC;i++){
     xmlfile=ARGV[i]
     split(xmlfile,fname,".")
     print "Doing file: "xmlfile
     print "---------------------------------"
     while( (getline line < xmlfile ) > 0)  {
         if ( line ~ /<narrative>/ ){
            line="<narrative>"nar[fname[1]]"</narrative>"
         }
         print line
     }
  }
}' rtxt 42.xml 71.xml