我有一个特殊的问题,即使用awk正则表达式匹配替换xml文件中的某些文本。
xml文件很简单。每个xml的节点中都有一段文本,而awk程序将此文本替换为从文本文件rtxt中选取的另一段文本。但由于某种原因,用rtxt(标记为'42')替换42.xml中的文本的文本不会产生正确的替换。
toxml.awk写入stdout。它首先打印xml,因为它已经读取了它,然后是最终替换的结果。
我实际上有一些这些xml文件的集合,我用更长的rtxt中的文本替换。碰巧这个特定的替换(对于42.xml)不起作用。而不是替换元素中的文本,另一个标记嵌套在现有标记中。
toxml.awk
BEGIN{
srcfile = "rtxt"
FS = "|"
while (getline <srcfile) {
xmlfile = $1 ".xml"
rep = "<narrative>" $2 "</narrative>"
## read in the xml file in one go.
## (the last tag will be missing.)
RS = "</topic>"
FS = "</topic>"
getline <xmlfile
#print $0
close(xmlfile)
## replace
subs = gsub(/<narrative>.*<\/narrative>/, rep, $0)
## append the closing tag
subs = gsub(/[ \n\r\t]+$/, "\n</topic>", $0)
print $0
## restore them before reading rtxt.
RS = "\n"
FS = "|"
}
close(srcfile)
}
rtxt
42 |显示Java培训机构的详细信息,以及提供Java解决方案的IT公司也被认为是不相关的。 Java是Sun Microsystems开发的一种流行的编程语言。我很想知道这种编程语言,也学习编程。为了相关,结果应该提供有关Java和Java的历史的信息。在不同版本的Java上,以及Java中的不同概念。如果我找到学习Java的教程,那就太好了。仅与Sun Microsystems相关但不与Java相关的结果被认为是不相关的。我喜欢找到讨论这种编程语言和各种概念的文章。它的版本。
42.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE topic SYSTEM "topic.dtd">
<topic id="2009042" ct_no="227">
<title>sun java</title>
<castitle>//article[about(.//language, java) or about(.,sun)]//sec[about(.//language, java)]</castitle>
<phrasetitle>"sun java"</phrasetitle>
<description>Find information about Sun Microsystem's Java language</description>
<narrative>Java is a popular programming language developed at Sun Microsystems. I am interested to know about this programming language, and also to learn programming with it. To be relevant, a result should give information on history of Java & on different versions of Java, and on different concepts in Java. Its good if I find tutorials for learning Java. Results related only to Sun Microsystems but not Java are considered non-relevant. Results showing details of training institutes for Java, and IT companies which provide Java solutions are also considered non-relevant. I like to find articles that discuss this programming language and various concepts & versions of it. </narrative>
</topic>
答案 0 :(得分:0)
只是一个开始
#!/bin/bash
awk 'BEGIN{FS="|"}
FNR==NR{ nar[$1]=$2; next }
END{
for(i=2;i<ARGC;i++){
xmlfile=ARGV[i]
split(xmlfile,fname,".")
print "Doing file: "xmlfile
print "---------------------------------"
while( (getline line < xmlfile ) > 0) {
if ( line ~ /<narrative>/ ){
line="<narrative>"nar[fname[1]]"</narrative>"
}
print line
}
}
}' rtxt 42.xml 71.xml