我在csv file中有一个引用列表,我希望用它来填写基于XML的查询表单CrossRef
CrossRef提供了一个XML模板(下面,删除了未使用的字段),我想解析csv文件的列以填写query
标记内的重复字段:
<?xml version = "1.0" encoding="UTF-8"?>
<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
<email_address>test@crossref.org</email_address>
<doi_batch_id>test</doi_batch_id>
</head>
<body>
<query enable-multiple-hits="true"
list-components="false"
expanded-results="false" key="key">
<article_title match="fuzzy"></article_title>
<author search-all-authors="false"></author>
<volume></volume>
<year></year>
<first_page></first_page>
<journal_title></journal_title>
</query>
</body>
</query_batch>
如何在shell脚本中完成?
示例输入:
author,year,article_title,journal_title,volume,first_page
Adler,2006,"Biomass yield and biofuel quality of switchgrass harvested in fall or spring","Agronomy Journal",98,1518
Alexopolou,2008,"Biomass yields for upland and lowland switchgrass varieties grown in the Mediterranean region","Biomass and Bioenergy",32,926
Balasko,1984,"Yield and Quality of Switchgrass Grown without Soil Amendments.","Agronomy Journal",76,204
期望的输出:
<?xml version = "1.0" encoding="UTF-8"?>
<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
<email_address>test@crossref.org</email_address>
<doi_batch_id>test</doi_batch_id>
</head>
<body>
<query>
<author>Adler</author >
<year>2006</year >
<article_title>Biomass yield and biofuel quality of switchgrass harvested in fall or spring</article_title >
<journal_title>Agronomy Journal</journal_title >
<volume>98</volume >
<first_page>1518</first_page >
</query>
<query>
<author>Alexopolou</author >
<year>2008</year >
<article_title>Biomass yields for upland and lowland switchgrass varieties grown in the Mediterranean region</article_title >
<journal_title>Biomass and Bioenergy</journal_title >
<volume>32</volume >
<first_page>926</first_page >
</query>
<query>
<author>Balasko</author >
<year>1984</year >
<article_title>Yield and Quality of Switchgrass Grown without Soil Amendments.</article_title >
<journal_title>Agronomy Journal</journal_title >
<volume>76</volume >
<first_page>204</first_page >
</query>
</body>
中执行此操作
答案 0 :(得分:3)
#!/usr/bin/awk -f
# XML Attributes Must be Quoted. Attribute values must always be quoted. Either single or double quotes can be used.
BEGIN{
FS=","
print "<?xml version = '1.0' encoding='UTF-8'?>"
print "<query_batch xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' version='2.0' xmlns='http://www.crossref.org/qschema/2.0'"
print " xsi:schemaLocation='http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd'>"
print "<head>"
print " <email_address>test@crossref.org</email_address>"
print " <doi_batch_id>test</doi_batch_id>"
print "</head>"
print "<body>"
}
NR>1{
print " <query enable-multiple-hits='true'"
print " list-components='false'"
print " expanded-results='false' key='key'>"
print " <article_title match='fuzzy'>" $3 "</article_title>"
print " <author search-all-authors='false'>" $1 "</author>"
print " <volume>" $5 "</volume>"
print " <year>" $2 "</year>"
print " <first_page>" $6 "</first_page>"
print " <journal_title>" $4 "</journal_title>"
print " </query>"
}
END{
print "</body>"
print "</query_batch>"
}
$ awk -f script.awk input.csv
答案 1 :(得分:3)
与使用文本替换(即awk)的方法不同,这个方法保证始终发出格式良好的XML文档,并正确地转义内容。这很难看,但更正确。请注意,这需要第三方工具; shell中包含的任何内容都无法安全地编辑XML。
首先,在body
中放置一个没有template.xml
的文档:
<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
<email_address>test@crossref.org</email_address>
<doi_batch_id>test</doi_batch_id>
</head>
<body/>
</query_batch>
其次,构建一个描述所需编辑的XMLStarlet命令行,然后调用它:
#!/bin/bash
xmlstarlet_command=( )
read_header=0
while IFS=, read author year article_title journal_title volume first_page; do
if (( read_header == 0 )); then read_header=1; continue; fi
xmlstarlet_command+=( -s /qs:query_batch/qs:body -t elem -n query -v '' )
xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n enable-multiple-hits -v true )
xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n list-components -v false )
xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n expanded-results -v false )
xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n key -v key )
xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n enable-multiple-hits -v true )
xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n article_title -v "$article_title" )
xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]/article-title' -t attr -n match -v fuzzy )
xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n author -v "$author" )
xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]/author' -t attr -n search-all-authors -v false )
xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n volume -v "$volume" )
xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n year -v "$year" )
xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n first_page -v "$first_page" )
xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n journal_title -v "$journal_title" )
done <in.csv
xmlstarlet ed -N qs=http://www.crossref.org/qschema/2.0 "${xmlstarlet_command[@]}" <template.xml
请注意,与此处给出的其他解决方案一样,这不会从CSV元素的开头和结尾删除双引号;与高级CSV解析的其他方面一样,最好留给Python CSV模块,它实际上知道如何识别转义引号,包含换行符的文本字段,以及有效CSV文件中可能发生的所有其他奇怪的事情。
顺便一提 - 请注意旧版本的XMLStarlet在最新版本中已修复a limit on the number of operations per invocation。我有一个解决方法(它还允许编辑列表长度大于〜32K左右的最大命令行长度),但它可能应该是它自己的问题。