我有XML文件
<Cluster clsId="UNIPR_NIRI_PARDP" semType="geneProt"> <Entry entryId="UNIPR_NIRI_PARDP_1" baseForm="Protein nirI" type="PREFERRED">
<Variant WRITTENFORM="FMN-binding domain protein" type="orthographic"/> <Variant WRITTENFORM="FMN-binding domain-containing protein" type="orthographic"/> <Variant WRITTENFORM="unknown" type="orthographic"/> <Variant WRITTENFORM="FMN-binding" type="orthographic"/> <Variant WRITTENFORM="Pden_2486" type="orthographic"/> <Variant WRITTENFORM="nirI" type="orthographic"/> <SourceDC sourceName="BioThesaurus" sourceId="Q51699"/> <PosDC posName="POS" pos="N"/> <DC att="uniprot_ac" val="Q51699"/> <DC att="speciesNameNCBI" val="318586"/>
</Entry> </Cluster>
我需要将此内容导入postgresql。请在这方面帮助我直接程序或将XML转换为csv到PostgreSQL。
我需要带有像
这样的列的表格clsid,entryid,semType,baseForm,variant(writeform),variant(type),dc(att),dc(val)
提前谢谢。
答案 0 :(得分:0)
首先,解析xml文件以获取包含所需信息的文件。
例如,如果你想只有一个包含属性clsid,entryid,semType,baseForm,variant(writeform),variant(type),dc(att),dc(val)的表,那么你只需要一个文件具有这些属性(用某些字符分隔)。文件中的每一行都对应于表中的每一行。
接下来,在Postgresql中创建表模式。然后使用Postgresql的COPY命令,该命令将所有数据从文件复制到表。
请注意,如果您的xml文件很大,则应使用基于事件的解析器。像SAX,Java中的StAX之类的东西。
修改强> * 注意 *:使用的库:stax2-api-3.1.1.jar,woodstox-core-asl-4.1.1jar 这是代码(希望它能满足您的需求,如果不是,我相信它可以帮助您开始):
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package test;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.net.MalformedURLException;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import java.util.ArrayList;
import org.codehaus.stax2.XMLInputFactory2;
import org.codehaus.stax2.XMLStreamReader2;
public class Main {
/**
* @param args the command line arguments
*/
/*
* dc(att), dc(val)
*/
@SuppressWarnings("CallToThreadDumpStack")
public static void main(String[] args) throws MalformedURLException, IOException, XMLStreamException {
FileInputStream fstream = new FileInputStream(args[0]);
Reader in = new InputStreamReader(fstream, "UTF-8");
XMLInputFactory2 factory = (XMLInputFactory2) XMLInputFactory.newInstance();
XMLStreamReader2 parser = (XMLStreamReader2) factory.createXMLStreamReader(in);
FileOutputStream outStream = new FileOutputStream("/home/aseke/Desktop/out.txt");
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(outStream, "UTF-8"));
boolean isCluster = false;
ArrayList<String> dc = new ArrayList<String>();
ArrayList<String> variants = new ArrayList<String>();
/* You actually do not need all of these variables, it's just for clarity */
String clsID = null;
String semType = null;
String varWritten = null;
String varType = null;
String entryID = null;
String baseForm = null;
String dcAtt = null;
String dcVal = null;
String s = null;
while (true) {
int event = parser.next();
if (event == XMLStreamConstants.END_DOCUMENT) {
parser.close();
break;
}
if (event == XMLStreamConstants.START_ELEMENT) {
String tag = parser.getLocalName();
if (tag.equals("Cluster")) {
isCluster = true;
clsID = parser.getAttributeValue(0);
semType = parser.getAttributeValue(1);
} else if (tag.equals("Entry") && isCluster) {
entryID = parser.getAttributeValue(0);
baseForm = parser.getAttributeValue(1);
} else if (tag.equals("Variant") && isCluster) {
varWritten = parser.getAttributeValue(0);
varType = parser.getAttributeValue(1);
variants.add(varWritten + "~" + varType);
} else if (tag.equals("DC") && isCluster) {
dcAtt = parser.getAttributeValue(0);
dcVal = parser.getAttributeValue(1);
dc.add(dcAtt + "~" + dcVal);
}
}
if (event == XMLStreamConstants.END_ELEMENT && isCluster) {
if (parser.getLocalName().equals("Cluster")) {
isCluster = false;
//clsid, entryid, semType, baseForm, variant(writtenform), variant(type), dc(att), dc(val)
// Use tabs as delimiter for Postgre COPY
String outStr = clsID + "/t" + entryID + "/t" + semType + "/t" + baseForm + "/t";
/* Add all variants */
for (String var : variants) {
String tmp[] = var.split("~");
varWritten = tmp[0];
varType = tmp[1];
outStr += varWritten + "/t" + varType + "/t";
}
/* Add al DCs */
for (String ss : dc) {
String[] tmp = ss.split("~");
dcAtt = tmp[0];
dcVal = tmp[1];
outStr += dcAtt + "/t" + dcVal + "/t";
}
// remove last tab "\t"
outStr = outStr.substring(0, outStr.length() - 2);
out.write(outStr);
variants.clear();
dc.clear();
}
}
}
// close all streams
fstream.close();
out.close();
outStream.close();
}
}
我格式化你输入xml 。所以输入文件如下所示:
<Cluster clsId="UNIPR_NIRI_PARDP" semType="geneProt">
<Entry entryId="UNIPR_NIRI_PARDP_1" baseForm="Protein nirI" type="PREFERRED">
<Variant WRITTENFORM="FMN-binding domain protein" type="orthographic"/>
<Variant WRITTENFORM="FMN-binding domain-containing protein" type="orthographic"/>
<Variant WRITTENFORM="unknown" type="orthographic"/>
<Variant WRITTENFORM="FMN-binding" type="orthographic"/>
<Variant WRITTENFORM="Pden_2486" type="orthographic"/>
<Variant WRITTENFORM="nirI" type="orthographic"/>
<SourceDC sourceName="BioThesaurus" sourceId="Q51699"/>
<PosDC posName="POS" pos="N"/>
<DC att="uniprot_ac" val="Q51699"/>
<DC att="speciesNameNCBI" val="318586"/>
</Entry>
</Cluster>
输出看起来像这样。请注意,它使用制表符分隔。选项卡稍后将用作Postgre COPY命令中的分隔符。您可以将分隔符更改为任何其他分隔符。
UNIPR_NIRI_PARDP/tUNIPR_NIRI_PARDP_1/tgeneProt/tProtein nirI/tFMN-binding domain protein/torthographic/tFMN-binding domain-containing protein/torthographic/tunknown/torthographic/tFMN-binding/torthographic/tPden_2486/torthographic/tnirI/torthographic/tuniprot_ac/tQ51699/tspeciesNameNCBI/t318586
答案 1 :(得分:0)
我使用Ruby noko-giri和open-uri帮助使用Ruby。因为,我的输入文件太大了。许多解析器都失败了,noko-giri帮助了这个。
我提供了三个列的答案,baseForm-variant(writeform)-dc(val)。这些信息可能是该问题的清晰信息。
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(File.open("xai"))
ent = doc.xpath("//Entry")
value = String.new
ent.each do |e|
d = e.xpath("DC")
d.each do |f|
if f.attributes["att"].to_s =~ /uniprot_ac/
value = f.attributes["val"].to_s
end
end
f = e.xpath("Variant")
f.each do |g|
puts "#{e.attributes["baseForm"].to_s}\t" + "#{g.attributes["WRITTENFORM"].to_s}\t" + "#{value}"
end
end