我有很多XML文件,我想从他们那里生成报告。报告应提供以下信息:
root 100%
a*1 90%
b*1 80%
c*5 40%
意味着所有文档都有一个根元素,90%在根中有一个 a 元素,80%在根中有一个 b 元素,40%有5个 b 中的 c 元素。
例如,如果某些文档有4个 c 元素,大约5个和6个,那么应该说:
c*4.3 4 6 40%
意味着40%的元素有4到6个 c 元素,平均值为4.3。
我正在寻找免费软件,如果它不存在我会写它。我即将这样做,但我考虑过检查它。我可能不是第一个必须分析并获得数千个XML文件的结构概述的人。
答案 0 :(得分:11)
这是一个XSLT 2.0方法。
假设$docs
包含您要扫描的一系列文档节点,您希望为文档中显示的每个元素创建一行。您可以使用<xsl:for-each-group>
执行此操作:
<xsl:for-each-group select="$docs//*" group-by="name()">
<xsl:sort select="current-group-key()" />
<xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
<xsl:value-of select="$name" />
...
</xsl:for-each-group>
然后你想在文件中找出该元素的统计数据。首先,找到文档中包含该名称的元素:
<xsl:variable name="docs-with" as="document-node()+"
select="$docs[//*[name() = $name]" />
其次,您需要在每个文档中包含该名称的元素数量的序列:
<xsl:variable name="elem-counts" as="xs:integer+"
select="$docs-with/count(//*[name() = $name])" />
现在你可以进行计算了。可以使用avg()
,min()
和max()
函数计算平均值,最小值和最大值。百分比只是包含元素的文档数除以格式化的文档总数。
把它们放在一起:
<xsl:for-each-group select="$docs//*" group-by="name()">
<xsl:sort select="current-group-key()" />
<xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
<xsl:variable name="docs-with" as="document-node()+"
select="$docs[//*[name() = $name]" />
<xsl:variable name="elem-counts" as="xs:integer+"
select="$docs-with/count(//*[name() = $name])" />
<xsl:value-of select="$name" />
<xsl:text>* </xsl:text>
<xsl:value-of select="format-number(avg($elem-counts), '#,##0.0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number(min($elem-counts), '#,##0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number(max($elem-counts), '#,##0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number((count($docs-with) div count($docs)) * 100, '#0')" />
<xsl:text>%</xsl:text>
<xsl:text>
</xsl:text>
</xsl:for-each-group>
我在这里没有做的是根据元素的深度缩进线条。我刚按字母顺序排列元素,为您提供统计数据。有两个原因:首先,在某种结构中显示元素统计数据反映出它们在文档中的显示方式(尤其是因为不同的文档可能具有不同的结构),因此显着更难(在此处编写太多)。其次,在许多标记语言中,文档的精确结构无法知晓(例如,因为部分可以在部分内嵌套到任何深度)。
我希望它一点也不用。
更新:
需要XSLT包装器和一些运行XSLT的说明吗?好。首先,请抓住Saxon 9B。
您需要将要分析的所有文件放在目录中。 Saxon允许您使用special URI syntax使用集合访问该目录(或其子目录)中的所有文件。如果您想要递归搜索或过滤您正在查看的文件名,那么值得查看该语法。
现在是完整的XSLT:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs">
<xsl:param name="dir" as="xs:string"
select="'file:///path/to/default/directory?select=*.xml'" />
<xsl:output method="text" />
<xsl:variable name="docs" as="document-node()*"
select="collection($dir)" />
<xsl:template name="main">
<xsl:for-each-group select="$docs//*" group-by="name()">
<xsl:sort select="current-group-key()" />
<xsl:variable name="name" as="xs:string" select="current-grouping-key()" />
<xsl:variable name="docs-with" as="document-node()+"
select="$docs[//*[name() = $name]" />
<xsl:variable name="elem-counts" as="xs:integer+"
select="$docs-with/count(//*[name() = $name])" />
<xsl:value-of select="$name" />
<xsl:text>* </xsl:text>
<xsl:value-of select="format-number(avg($elem-counts), '#,##0.0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number(min($elem-counts), '#,##0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number(max($elem-counts), '#,##0')" />
<xsl:text> </xsl:text>
<xsl:value-of select="format-number((count($docs-with) div count($docs)) * 100, '#0')" />
<xsl:text>%</xsl:text>
<xsl:text>
</xsl:text>
</xsl:for-each-group>
</xsl:template>
</xsl:stylesheet>
要运行它,你会做类似的事情:
> java -jar path/to/saxon.jar -it:main -o:report.txt dir=file:///path/to/your/directory?select=*.xml
这告诉Saxon使用名为main
的模板启动流程,将dir
参数设置为file:///path/to/your/directory?select=*.xml
并将输出发送到report.txt
。
答案 1 :(得分:3)
答案 2 :(得分:1)
Beautiful Soup使得在Python中解析XML变得微不足道。
答案 3 :(得分:0)
[社区帖子,这里:没有涉及业力;)] 我在这里建议 code-challenge :
解析xmlfiles.com/examples中的所有xml查找并尝试提供以下输出:
Analyzing plant_catalog.xml:
Analyzing note.xml:
Analyzing portfolio.xml:
Analyzing note_ex_dtd.xml:
Analyzing home.xml:
Analyzing simple.xml:
Analyzing cd_catalog.xml:
Analyzing portfolio_xsl.xml:
Analyzing note_in_dtd.xml:
Statistical Elements Analysis of 9 xml documents with 34 elements
CATALOG*2 22%
CD*26 50%
ARTIST*26 100%
COMPANY*26 100%
COUNTRY*26 100%
PRICE*26 100%
TITLE*26 100%
YEAR*26 100%
PLANT*36 50%
AVAILABILITY*36 100%
BOTANICAL*36 100%
COMMON*36 100%
LIGHT*36 100%
PRICE*36 100%
ZONE*36 100%
breakfast-menu*1 11%
food*5 100%
calories*5 100%
description*5 100%
name*5 100%
price*5 100%
note*3 33%
body*1 100%
from*1 100%
heading*1 100%
to*1 100%
page*1 11%
para*1 100%
title*1 100%
portfolio*2 22%
stock*2 100%
name*2 100%
price*2 100%
symbol*2 100%
答案 4 :(得分:0)
以下是这个code-challenge的红宝石的可能解决方案 由于这是我的第一个ruby程序,我确信它的编码非常严格,但至少可以回答J. Pablo Fernandez的问题。
将其复制粘贴到'.rb文件中并调用ruby。如果您有Internet连接,它将起作用;)
require "rexml/document"
require "net/http"
require "iconv"
include REXML
class NodeAnalyzer
@@fullPathToFilesToSubNodesNamesToCardinalities = Hash.new()
@@fullPathsToFiles = Hash.new() #list of files in which a fullPath node is detected
@@fullPaths = Array.new # all fullpaths sorted alphabetically
attr_reader :name, :father, :subNodesAnalyzers, :indent, :file, :subNodesNamesToCardinalities
def initialize(aName="", aFather=nil, aFile="")
@name = aName; @father = aFather; @subNodesAnalyzers = []; @file = aFile
@subNodesNamesToCardinalities = Hash.new(0)
if aFather && !aFather.name.empty? then @indent = " " else @indent = "" end
if aFather
@indent = @father.indent + self.indent
@father.subNodesAnalyzers << self
@father.updateSubNodesNamesToCardinalities(@name)
end
end
@@nodesRootAnalyzer = NodeAnalyzer.new
def NodeAnalyzer.nodesRootAnalyzer
return @@nodesRootAnalyzer
end
def updateSubNodesNamesToCardinalities(aSubNodeName)
aSubNodeCardinality = @subNodesNamesToCardinalities[aSubNodeName]
@subNodesNamesToCardinalities[aSubNodeName] = aSubNodeCardinality + 1
end
def NodeAnalyzer.recordNode(aNodeAnalyzer)
if aNodeAnalyzer.fullNodePath.empty? == false
if @@fullPaths.include?(aNodeAnalyzer.fullNodePath) == false then @@fullPaths << aNodeAnalyzer.fullNodePath end
# record a full path in regard to its xml file (records it only one for a given xlm file)
someFiles = @@fullPathsToFiles[aNodeAnalyzer.fullNodePath]
if someFiles == nil
someFiles = Array.new(); @@fullPathsToFiles[aNodeAnalyzer.fullNodePath] = someFiles;
end
if !someFiles.include?(aNodeAnalyzer.file) then someFiles << aNodeAnalyzer.file end
end
#record cardinalties of sub nodes for a given xml file
someFilesToSubNodesNamesToCardinalities = @@fullPathToFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.fullNodePath]
if someFilesToSubNodesNamesToCardinalities == nil
someFilesToSubNodesNamesToCardinalities = Hash.new(); @@fullPathToFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.fullNodePath] = someFilesToSubNodesNamesToCardinalities ;
end
someSubNodesNamesToCardinalities = someFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.file]
if someSubNodesNamesToCardinalities == nil
someSubNodesNamesToCardinalities = Hash.new(0); someFilesToSubNodesNamesToCardinalities[aNodeAnalyzer.file] = someSubNodesNamesToCardinalities
someSubNodesNamesToCardinalities.update(aNodeAnalyzer.subNodesNamesToCardinalities)
else
aNodeAnalyzer.subNodesNamesToCardinalities.each() do |aSubNodeName, aCardinality|
someSubNodesNamesToCardinalities[aSubNodeName] = someSubNodesNamesToCardinalities[aSubNodeName] + aCardinality
end
end
#puts "someSubNodesNamesToCardinalities for #{aNodeAnalyzer.fullNodePath}: #{someSubNodesNamesToCardinalities}"
end
def file
#if @file.empty? then @father.file else return @file end
if @file.empty? then if @father != nil then return @father.file else return '' end else return @file end
end
def fullNodePath
if @father == nil then return '' elsif @father.name.empty? then return @name else return @father.fullNodePath+"/"+@name end
end
def to_s
s = ""
if @name.empty? == false
s = "#{@indent}#{self.fullNodePath} [#{self.file}]\n"
end
@subNodesAnalyzers.each() do |aSubNodeAnalyzer|
s = s + aSubNodeAnalyzer.to_s
end
return s
end
def NodeAnalyzer.displayStats(aFullPath="")
s = "";
if aFullPath.empty? then s = "Statistical Elements Analysis of #{@@nodesRootAnalyzer.subNodesAnalyzers.length} xml documents with #{@@fullPaths.length} elements\n" end
someFullPaths = @@fullPaths.sort
someFullPaths.each do |aFullPath|
s = s + getIndentedNameFromFullPath(aFullPath) + "*"
nbFilesWithThatFullPath = getNbFilesWithThatFullPath(aFullPath);
aParentFullPath = getParentFullPath(aFullPath)
nbFilesWithParentFullPath = getNbFilesWithThatFullPath(aParentFullPath);
aNameFromFullPath = getNameFromFullPath(aFullPath)
someFilesToSubNodesNamesToCardinalities = @@fullPathToFilesToSubNodesNamesToCardinalities[aParentFullPath]
someCardinalities = Array.new()
someFilesToSubNodesNamesToCardinalities.each() do |aFile, someSubNodesNamesToCardinalities|
aCardinality = someSubNodesNamesToCardinalities[aNameFromFullPath]
if aCardinality > 0 && someCardinalities.include?(aCardinality) == false then someCardinalities << aCardinality end
end
if someCardinalities.length == 1
s = s + someCardinalities.to_s + " "
else
anAvg = someCardinalities.inject(0) {|sum,value| Float(sum) + Float(value) } / Float(someCardinalities.length)
s = s + sprintf('%.1f', anAvg) + " " + someCardinalities.min.to_s + "..." + someCardinalities.max.to_s + " "
end
s = s + sprintf('%d', Float(nbFilesWithThatFullPath) / Float(nbFilesWithParentFullPath) * 100) + '%'
s = s + "\n"
end
return s
end
def NodeAnalyzer.getNameFromFullPath(aFullPath)
if aFullPath.include?("/") == false then return aFullPath end
aNameFromFullPath = aFullPath.dup
aNameFromFullPath[/^(?:[^\/]+\/)+/] = ""
return aNameFromFullPath
end
def NodeAnalyzer.getIndentedNameFromFullPath(aFullPath)
if aFullPath.include?("/") == false then return aFullPath end
anIndentedNameFromFullPath = aFullPath.dup
anIndentedNameFromFullPath = anIndentedNameFromFullPath.gsub(/[^\/]+\//, " ")
return anIndentedNameFromFullPath
end
def NodeAnalyzer.getParentFullPath(aFullPath)
if aFullPath.include?("/") == false then return "" end
aParentFullPath = aFullPath.dup
aParentFullPath[/\/[^\/]+$/] = ""
return aParentFullPath
end
def NodeAnalyzer.getNbFilesWithThatFullPath(aFullPath)
if aFullPath.empty?
return @@nodesRootAnalyzer.subNodesAnalyzers.length
else
return @@fullPathsToFiles[aFullPath].length;
end
end
end
class REXML::Document
def analyze(node, aFatherNodeAnalyzer, aFile="")
anNodeAnalyzer = NodeAnalyzer.new(node.name, aFatherNodeAnalyzer, aFile)
node.elements.each() do |aSubNode| analyze(aSubNode, anNodeAnalyzer) end
NodeAnalyzer.recordNode(anNodeAnalyzer)
end
end
begin
anXmlFilesDirectory = "xmlfiles.com/examples/"
anXmlFilesRegExp = Regexp.new("http:\/\/" + anXmlFilesDirectory + "([^\"]*)")
a = Net::HTTP.get(URI("http://www.google.fr/search?q=site:"+anXmlFilesDirectory+"+filetype:xml&num=100&as_qdr=all&filter=0"))
someXmlFiles = a.scan(anXmlFilesRegExp)
someXmlFiles.each() do |anXmlFile|
anXmlFileContent = Net::HTTP.get(URI("http://" + anXmlFilesDirectory + anXmlFile.to_s))
anUTF8XmlFileContent = Iconv.conv("ISO-8859-1//ignore", 'UTF-8', anXmlFileContent).gsub(/\s+encoding\s*=\s*\"[^\"]+\"\s*\?/,"?")
anXmlDocument = Document.new(anUTF8XmlFileContent)
puts "Analyzing #{anXmlFile}: #{NodeAnalyzer.nodesRootAnalyzer.name}"
anXmlDocument.analyze(anXmlDocument.root,NodeAnalyzer.nodesRootAnalyzer, anXmlFile.to_s)
end
NodeAnalyzer.recordNode(NodeAnalyzer.nodesRootAnalyzer)
puts NodeAnalyzer.displayStats
end
答案 5 :(得分:0)
与JeniT的回答一起 - 她是我在02年开始学习的第一个XSLT专家之一。要真正了解XML的强大功能,您应该使用XPath和XSLT并学习操作节点。