从Wikipedia中提取表数据并转换为XML文档

时间:2012-01-06 21:36:01

标签: php xml

Page:http://en.wikipedia.org/wiki/ISO_4217#Active_codes

是否可以提取每一个:

  • 货币代码
  • 货币标题
  • 货币地点

并且如果可能的话保存到这样的XML文档中:

<currency>
    <AED>
        <curr>United Arab Emirates dirham</curr>
        <loc>United Arab Emirates</loc>
    </AED>
</currency>
<currency>
    <AFN>
        <curr>Afghan afghani</curr>
        <loc>Afghanistan</loc>
    </AFN>
</currency>

我不确定这是否有帮助,但我发现您可以将Wiki页面转换为某种XML结构:

http://en.wikipedia.org/wiki/Special:Export/ISO_4217#Active_codes

感谢。

1 个答案:

答案 0 :(得分:2)

该表是以wiki格式创建的,因此可用: http://en.wikipedia.org/w/index.php?title=ISO_4217&action=edit&section=4

您可以编写一个脚本来将wiki格式解析为数组,并从中构建XML。尝试按换行符分割字符串(例如,使用explode),然后将每一行拆分为||,这将分隔表格列。

这样的事情:

$currencyList = array();
$source = "<insert wikipedia table code here>";

$rows = explode("\n", $source); // split the table in rows

foreach($rows as $row) {

    if(strlen(trim($row)) < 0) { continue; } // ignore empty rows
    if(trim($row) == "|-") { continue; } // ignore table line separators

    $row = substr($row, 2); // remove the "| " from the beginning of each row

    $cols = explode("||", $row); // split the row in columns

    $currency = array( // clean data and store in associative array
         'code' => trim($cols[0]),
         'number' => trim($cols[1]),
         'digits_after_decimal' => trim($cols[2]),
         'name' => trim($cols[3])
    );

    array_push($currencyList, $currency); // add the read currency to the list

}

var_dump($currencyList); // $currencyList now has a list of associative arrays with your data.

要构建XML,您可以尝试PHP的SimpleXML