XML到CSV:创建平面文件的本机方式?

时间:2012-10-02 10:12:45

标签: php xml parsing csv

有很多关于XML到CSV转换的问题,但我的相当精确,所以我在这里问。

我威胁库存并获取XML文件,如:

<item>
   <sku>abc</sku>
   <title>un livre</title>
   <price>42</price>
   <attributes>
      <attribute>
          <name>Nombre de pages</name>
          <value>123</value>
      </attribute>
      <attribute>
          <name>Auteur</name>
          <value>bob la mouche</value>
      </attribute>
   </attributes>
   <contributors>
      <contributor>toto</contributor>
      <contributor>titi</contributor>
      <contributor>tata</contributor>
   </contributors>
</item>

我需要将其转换为2D格式。我想使用一个众所周知的格式,如数组/对象处置,列看起来像:

sku
title
price
attributes.attribute[0].name
attributes.attribute[0].value
attributes.attribute[1].name
attributes.attribute[1].value
contributors.contributor[0]
contributors.contributor[1]
contributors.contributor[2]

如果没有直接答案,请不要为我编码,我知道如何制作,并很乐意与您分享。但这只是为了知道是否存在直接的方式(例如在xml解析lib上使用非常奇特的XML方法)。

由于

2 个答案:

答案 0 :(得分:1)

好的,这只是粗略的框架:

from lxml import etree
import csv

tree = etree.fromstring(xml_string, parser=etree.XMLParser())
with open(outfilepath, 'w') as f:
    writer = csv.writer(f)
    writer.writerow(<fieldnames>)
    for item_node in tree.xpath('//item'):
        var1 = item_node.xpath('.../text()')[0]
        ...
        writer.writerow(var1, var2, ...)

如果您的XML非常大并且无法将其放入内存中,那么您还可以使用不同的选项从文件中顺序读取。不过,我用这种方式解析了几个MB的HTML。

答案 1 :(得分:0)

没有找到任何有效的方法来做到这一点。 RecursiveIteratorIterator是一个很好的候选者,但不幸的是,它只提供单值作为关键,而不是最终节点的完整路径。

我最终开发了一个完成这个功能的类:它接受任何XML文档并创建一个带有“数组/对象”处置的CSV。

/**
 * This class converts a multidimentional XML file into a CSV file.
 * First XML level are distinct CSV lines
 * Last XML level are values
 * Path from the first to the last XML levels are CSV column names.
 *
 * @todo May be conflicts with XML nodes with names finishing by [n].
 *              => Case <x>test</x><x>test</x><x[0]>test</x[0]>
 *                 will generate 2 times x[0] on CSV file.
 *
 * @author ninsuo <ninsuo at gmail dot com>
 */
class Xml2Csv
{

    /**
     * An array that looks like :
     *
     * array(
     *   'simple.name' => stdClass(
     *      ->names => array('real.name[0]', 'real.name[1]', ...)
     *      ->positions => array('position-in-rows-1', 'position-in-rows-2', ...)
     *   ),
     *   ...
     * )
     *
     * Allow us to create dynamic column names according to
     * content disposition.
     *
     * @access private
     * @var array
     */
    private $columnNames;

    /**
     * Rows of CSV file
     *
     * @access private
     * @var array
     */
    private $rows;

    /**
     * Current row number
     *
     * @access private
     * @var int
     */
    private $rowNumber;

    public function convert($xmlSource, $csvTarget)
    {
        $this->_checkSourceAndTarget($xmlSource, $csvTarget);

        libxml_use_internal_errors();
        try
        {
            $tree = new SimpleXMLIterator($xmlSource, 0, true);
        }
        catch (Exception $e)
        {
            throw new Exception("Can't load XML : " . $e->getMessage());
        }
        libxml_clear_errors();

        $this->_reset();

        $this->_browseXMLTree($tree);
        unset($tree);

        $this->_writeCSV($csvTarget);

        $this->_reset();
    }

    /**
     * Checks if $source file exists and is readable.
     * Checks if $target file is writable
     *
     * @access private
     * @param string $source
     * @param string $target
     * @throws Exception
     */
    private function _checkSourceAndTarget($source, $target)
    {
        if ((!is_file($source)) || (!is_readable($source)))
        {
            throw new Exception("Source file does not exist or is not readable.");
        }
        if (((is_file($target)) && (!is_writable($target))) || (!is_writable(dirname($target))))
        {
            throw new Exception("Target file is not writable.");
        }
    }

    /**
     * Reset attributes (avoid taking huge amount of memory when converting big files)
     *
     * @access private
     */
    private function _reset()
    {
        $this->columnNames = array ();
        $this->rows = array ();
        $this->rowNumber = 0;
    }

    /**
     * First XML-level are CSV rows
     *
     * @access private
     * @param SimpleXMLIterator $tree
     */
    private function _browseXMLTree($tree)
    {
        foreach ($tree as $node)
        {
            if (count($node) > 0)
            {
                $this->rows[$this->rowNumber] = array ();
                $this->_browseXMLNode($node);
                $this->rowNumber++;
            }
        }
    }

    /**
     * Browsing next XML levels until a node has no child (CSV value)
     *
     * @access private
     * @param type $node
     * @param array $path
     */
    private function _browseXMLNode($node, array &$path = array ())
    {
        array_push($path, $node->getName());
        foreach ($node as $key => $child)
        {
            if (count($child) > 0)
            {
                $this->_browseXMLNode($child, $path);
            }
            else
            {
                $this->_addValue(implode('.', $path) . '.' . $key, strval($child));
            }
        }
        array_pop($path);
    }

    /**
     * Create a CSV column if it does not exist.
     * Add a value to the given CSV column.
     *
     * @access private
     * @param string $path
     * @param string $value
     */
    private function _addValue($column, $value)
    {
        if (array_key_exists($column, $this->columnNames))
        {
            $columnInfo = $this->columnNames[$column];
            foreach ($columnInfo->positions as $position)
            {
                if (array_key_exists($position, $this->rows[$this->rowNumber]) == false)
                {
                    $this->rows[$this->rowNumber][$position] = $value;
                    return;
                }
            }
            if (count($columnInfo->positions) == 1)
            {
                $columnInfo->names[0] .= '[0]';
            }
            $columnPosition = $this->_countCSVColumns();
            array_push($columnInfo->names, $column . '[' . count($columnInfo->positions) . ']');
            array_push($columnInfo->positions, $columnPosition);
            $this->columnNames[$column] = $columnInfo;
            $this->rows[$this->rowNumber][$columnPosition] = $value;
        }
        else
        {
            $columnPosition = $this->_countCSVColumns();
            $columnInfo = new stdClass();
            $columnInfo->names[0] = $column;
            $columnInfo->positions[0] = $columnPosition;
            $this->columnNames[$column] = $columnInfo;
            $this->rows[$this->rowNumber][$columnPosition] = $value;
        }
    }

    /**
     * Return current number of columns in the CSV file.
     * Used to get index of a new column.
     *
     * @access private
     * @return int
     */
    private function _countCSVColumns()
    {
        $count = 0;
        foreach ($this->columnNames as $columnInfo)
        {
            $count += count($columnInfo->positions);
        }
        return $count;
    }

    /**
     * Write CSV file
     *
     * @access private
     * @param string $csvTarget
     */
    private function _writeCSV($csvTarget)
    {
        $columns = $this->_getCSVColumns();
        if (($handle = fopen($csvTarget, 'w')) === false)
        {
            throw new Exception("Cannot open target file : fopen() failed.");
        }
        $this->_writeCsvRow($handle, $columns);

        $columnPositions = array_keys($columns);
        $columnNumber = count($columnPositions);
        for ($currentRow = 0; ($currentRow < $this->rowNumber); $currentRow++)
        {
            $csvRow = array ();
            for ($currentColumn = 0; ($currentColumn < $columnNumber); $currentColumn++)
            {
                $position = $columnPositions[$currentColumn];
                if (array_key_exists($position, $this->rows[$currentRow]) == false)
                {
                    $csvRow[$position] = '';
                }
                else
                {
                    $csvRow[$position] = $this->rows[$currentRow][$position];
                }
            }
            $this->_writeCsvRow($handle, $csvRow);
        }

        fclose($handle);
    }

    /**
     * Return CSV columns as a single array
     *
     * @access private
     * @return array
     */
    private function _getCSVColumns()
    {
        $columns = array ();
        foreach ($this->columnNames as $columnInfo)
        {
            foreach ($columnInfo->names as $key => $name)
            {
                $columns[$columnInfo->positions[$key]] = $name;
            }
        }
        ksort($columns);
        return $columns;
    }

    /**
     * Write a row into CSV file
     *
     * @access private
     * @param resource $handle
     * @param array $csvRow
     * @throws Exception
     */
    private function _writeCsvRow($handle, $csvRow)
    {
        if (fputcsv($handle, $csvRow, "\t", '"') === false)
        {
            fclose($handle);
            throw new Exception("Cannot write target file, fputcsv() failed.");
        }
    }

}

演示:

1 /创建demo.xml文件:

<items>
<item>
   <sku>abc 1</sku>
   <title>a book 1</title>
   <price>42 1</price>
   <attributes>
      <attribute>
          <name>Number of pages 1</name>
          <value>123 1</value>
      </attribute>
      <attribute>
          <name>Author 1</name>
          <value>Rob dude 1</value>
      </attribute>
   </attributes>
   <contributors>
      <contributor>John 1</contributor>
      <contributor>Ryan 1</contributor>
   </contributors>
</item>
<item>
   <sku>abc 2</sku>
   <title>a book 2</title>
   <price>42 2</price>
   <attributes>
      <attribute>
          <name>Number of pages 2</name>
          <value>123 2</value>
      </attribute>
      <attribute>
          <name>Author 2</name>
          <value>Rob dude 2</value>
      </attribute>
   </attributes>
   <contributors>
      <contributor>John 2</contributor>
      <contributor>Ryan 2</contributor>
   </contributors>
</item>
</items>

2 /把它放在你工作的某个地方

$service = new Xml2Csv();
$service->convert('demo.xml', 'demo.csv');

3 /检查“demo.csv”输出:

item.sku    item.title  item.price  item.attributes.attribute.name[0]   item.attributes.attribute.value[0]  item.attributes.attribute.name[1]   item.attributes.attribute.value[1]  item.contributors.contributor[0]    item.contributors.contributor[1]
"abc 1" "a book 1"  "42 1"  "Number of pages 1" "123 1" "Author 1"  "Rob dude 1"    "John 1"    "Ryan 1"
"abc 2" "a book 2"  "42 2"  "Number of pages 2" "123 2" "Author 2"  "Rob dude 2"    "John 2"    "Ryan 2"

注意:这是快速编写的,但您可以通过向convert()方法添加一些参数来更改CSV分隔符或任何您想要的内容。

享受。