Question

我需要每天将不同格式的XML文档处理成MySQL数据库中的记录。我需要从每个XML文档中获取的数据中散布着大量我不需要的数据，每个文档的节点名称都不同。例如：

来源＃1：

<object id="1">
    <title>URL 1</title>
    <url>http://www.one.com</url>
    <frequency interval="60" />
    <uselessdata>blah</uselessdata>
</object>
<object id="2">
    <title>URL 2</title>
    <url>http://www.two.com</url>
    <frequency interval="60" />
    <uselessdata>blah</uselessdata>
</object>

来源＃2：

<object">
    <objectid>1</objectid>
    <thetitle>URL 1</thetitle>
    <link>http://www.one.com</link>
    <frequency interval="60" />
   <moreuselessdata>blah</moreuselessdata>
</object>
<object">
    <objectid>2</objectid>
    <thetitle>URL 2</thetitle>
    <link>http://www.two.com</link>
    <frequency interval="60" />
    <moreuselessdata>blah</moreuselessdata>
</object>

...我需要对象的ID，间隔和URL。

我对方法的看法是：

1。）有一个单独的函数来解析每个XML文档并从该函数中迭代地创建SQL查询

2.。）有一个单独的函数解析每个文档，并迭代地将每个对象添加到我自己的对象类，并通过类方法完成SQL工作

3.使用XSLT将所有文档转换为通用XML格式，然后为该文档编写解析器。

XML文档本身并不是那么大，因为大多数文档都不到1MB。我不认为他们的结构经常变化（如果有的话），但随着时间的推移，我很有可能需要添加和删除更多的来源。我对所有想法持开放态度。

另外，抱歉，如果上面的XML示例被破坏......它们并不是非常重要，只是一个粗略的想法，表明每个文档中的节点名称是不同的。

Answer 1

使用XSLT是一种过度杀伤力。我喜欢方法（2），它很有意义。

使用Python我会尝试为每种文档类型创建一个类。该类将继承自dict，并在其__init__解析给定文档，并使用'id'，'interval'和'url'填充自己。

然后main中的代码实际上是微不足道的，只是用适当的文档实例化那些类的实例（也是dicts），然后按正常的方式将它们传递出来。

Answer 2

我已成功使用变体第三种方法。但是我一直在处理的文件要大得多。如果它有点矫枉过正，那真的取决于你对XSLT的流利程度。

Answer 3

如果您的各种输入格式都是明确的，您可以这样做：

<xsl:template match="object">
  <object>
    <id><xsl:value-of select="@id | objectid" /></id>
    <title><xsl:value-of select="title | thetitle" /></title>
    <url><xsl:value-of select="url | link" /></url>
    <interval><xsl:value-of select="frequency/@interval" /></interval>
  </object>
</xsl:template>

对于您的样本输入，这会产生：

<object>
  <id>1</id>
  <title>URL 1</title>
  <url>http://www.one.com</url>
  <interval>60</interval>
</object>
<object>
  <id>2</id>
  <title>URL 2</title>
  <url>http://www.two.com</url>
  <interval>60</interval>
</object>
<object>
  <id>1</id>
  <title>URL 1</title>
  <url>http://www.one.com</url>
  <interval>60</interval>
</object>
<object>
  <id>2</id>
  <title>URL 2</title>
  <url>http://www.two.com</url>
  <interval>60</interval>
</object>

但是，与使用XSLT相比，实现可用结果的方法可能更快。只需测量每种方法的速度，以及如果感觉“丑陋”。我倾向于说XSLT是处理XML的更优雅/可维护的解决方案。 YMMV。

如果您的输入格式不明确且上述解决方案产生了错误的结果，则需要更明确的方法，如下所示：

<xsl:template match="object">
  <object>
    <xsl:choose>
      <xsl:when test="@id and title and url and frequency/@interval">
        <xsl:apply-templates select="." mode="format1" />
      </xsl:when>
      <xsl:when test="objectid and thetitle and link and frequency/@interval">
        <xsl:apply-templates select="." mode="format2" />
      </xsl:when>
    </xsl:choose>
  </object>
</xsl:template>

<xsl:template match="object" mode="format1">
  <id><xsl:value-of select="@id" /></id>
  <title><xsl:value-of select="title" /></title>
  <url><xsl:value-of select="url" /></url>
  <interval><xsl:value-of select="frequency/@interval" /></interval>
</xsl:template>

<xsl:template match="object" mode="format2">
  <id><xsl:value-of select="objectid" /></id>
  <title><xsl:value-of select="thetitle" /></title>
  <url><xsl:value-of select="link" /></url>
  <interval><xsl:value-of select="frequency/@interval" /></interval>
</xsl:template>

以良好的形式将XML处理成MySQL

3 个答案: