早上好 -
我有兴趣看到一种解析层次文本文件值的有效方法(即,具有Title => Multiple Headings => Multiple Subheadings => Multiple的文件Keys => Multiple Values)成一个简单的XML文档。为简单起见,答案将使用:
编写这是我正在使用的库存文件的示例。请注意,Header = FOODS ,Sub-Header = Type(A,B ...),Keys = PRODUCT(或CODE等)和值可能还有一行。
**FOODS - TYPE A**
___________________________________
**PRODUCT**
1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;
2) La Fe String Cheese
**CODE**
Sell by date going back to February 1, 2009
**MANUFACTURER**
Quesos Mi Pueblito, LLC, Passaic, NJ.
**VOLUME OF UNITS**
11,000 boxes
**DISTRIBUTION**
NJ, NY, DE, MD, CT, VA
___________________________________
**PRODUCT**
1) Peanut Brittle No Sugar Added;
2) Peanut Brittle Small Grind;
3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating
**CODE**
1) Lots 7109 - 8350 inclusive;
2) Lots 8198 - 8330 inclusive;
3) Lots 7075 - 9012 inclusive;
4) Lots 7100 - 8057 inclusive;
5) Lots 7152 - 8364 inclusive
**MANUFACTURER**
Star Kay White, Inc., Congers, NY.
**VOLUME OF UNITS**
5,749 units
**DISTRIBUTION**
NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN
**FOODS - TYPE B**
___________________________________
**PRODUCT**
Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;
**CODE**
990-10/2 10/5
**MANUFACTURER**
San Mar Manufacturing Corp., Catano, PR.
**VOLUME OF UNITS**
384
**DISTRIBUTION**
PR
这是所需的输出(请原谅任何XML语法错误):
<foods>
<food type = "A" >
<product>Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese</product>
<product>La Fe String Cheese</product>
<code>Sell by date going back to February 1, 2009</code>
<manufacturer>Quesos Mi Pueblito, LLC, Passaic, NJ.</manufacturer>
<volume>11,000 boxes</volume>
<distibution>NJ, NY, DE, MD, CT, VA</distribution>
</food>
<food type = "A" >
<product>Peanut Brittle No Sugar Added</product>
<product>Peanut Brittle Small Grind</product>
<product>Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</product>
<code>Lots 7109 - 8350 inclusive</code>
<code>Lots 8198 - 8330 inclusive</code>
<code>Lots 7075 - 9012 inclusive</code>
<code>Lots 7100 - 8057 inclusive</code>
<code>Lots 7152 - 8364 inclusive</code>
<manufacturer>Star Kay White, Inc., Congers, NY.</manufacturer>
<volume>5,749 units</volume>
<distibution>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</distribution>
</food>
<food type = "B" >
<product>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice</product>
<code>990-10/2 10/5</code>
<manufacturer>San Mar Manufacturing Corp., Catano, PR</manufacturer>
<volume>384</volume>
<distibution>PR</distribution>
</food>
</FOODS>
<!-- and so forth -->
到目前为止,我的方法(使用庞大的文本文件可能效率很低)将是以下之一:
循环和多个Select / Case语句,其中文件被加载到字符串缓冲区中,并在循环遍历每一行时,查看它是否与标题/子标题/键之一匹配行,将适当的xml标记附加到xml字符串变量,然后根据关于哪个键名最新的IF语句将子节点添加到xml(这看起来很耗时且容易出错,特别是如果文本更改甚至稍微) - 或者
使用REGEX(正则表达式)查找和替换具有相应xml标记的关键字段,使用xml库清理它,然后导出xml文件。问题是,我几乎不使用正则表达式,所以我需要一些基于示例的帮助。
任何帮助或建议都将不胜感激。
感谢。
答案 0 :(得分:2)
您可以用作起点的示例。至少我希望它能给你一个想法...
<?php
define('TYPE_HEADER', 1);
define('TYPE_KEY', 2);
define('TYPE_DELIMETER', 3);
define('TYPE_VALUE', 4);
$datafile = 'data.txt';
$fp = fopen($datafile, 'rb') or die('!fopen');
// stores (the first) {header} in 'name' and the root simplexmlelement in 'element'
$container = array('name'=>null, 'element'=>null);
// stores the name for each item element, the value for the type attribute for subsequent item elements and the simplexmlelement of the current item element
$item = array('name'=>null, 'type'=>null, 'current_element'=>null);
// the last **key** encountered, used to create new child elements in the current item element when a value is encountered
$key = null;
while ( false!==($t=getstruct($fp)) ) {
switch( $t[0] ) {
case TYPE_HEADER:
if ( is_null($container['element']) ) {
// this is the first time we hit **header - subheader**
$container['name'] = $t[1][0];
// ugly hack, < . name . />
$container['element'] = new SimpleXMLElement('<'.$container['name'].'/>');
// each subsequent new item gets the new subheader as type attribute
$item['type'] = $t[1][1];
// dummy implementation: "deducting" the item names from header/container[name]
$item['name'] = substr($t[1][0], 0, -1);
}
else {
// hitting **header - subheader** the (second, third, nth) time
/*
header must be the same as the first time (stored in container['name']).
Otherwise you need another container element since
xml documents can only have one root element
*/
if ( $container['name'] !== $t[1][0] ) {
echo $container['name'], "!==", $t[1][0], "\n";
die('format error');
}
else {
// subheader may have changed, store it for future item elements
$item['type'] = $t[1][1];
}
}
break;
case TYPE_DELIMETER:
assert( !is_null($container['element']) );
assert( !is_null($item['name']) );
assert( !is_null($item['type']) );
/* that's maybe not a wise choice.
You might want to check the complete item before appending it to the document.
But the example is a hack anyway ...so create a new item element and append it to the container right away
*/
$item['current_element'] = $container['element']->addChild($item['name']);
// set the type-attribute according to the last **header - subheader** encountered
$item['current_element']['type'] = $item['type'];
break;
case TYPE_KEY:
$key = $t[1][0];
break;
case TYPE_VALUE:
assert( !is_null($item['current_element']) );
assert( !is_null($key) );
// this is a value belonging to the "last" key encountered
// create a new "key" element with the value as content
// and addit to the current item element
$tmp = $item['current_element']->addChild($key, $t[1][0]);
break;
default:
die('unknown token');
}
}
if ( !is_null($container['element']) ) {
$doc = dom_import_simplexml($container['element']);
$doc = $doc->ownerDocument;
$doc->formatOutput = true;
echo $doc->saveXML();
}
die;
/*
Take a look at gettoken() at http://www.tuxradar.com/practicalphp/21/5/6
It breaks the stream into much simpler pieces.
In the next step the parser would "combine" or structure the simple tokens into more complex things.
This function does both....
@return array(id, array(parameter)
*/
function getstruct($fp) {
if ( feof($fp) ) {
return false;
}
// shortcut: all we care about "happens" on one line
// so let php read one line in a single step and then do the pattern matching
$line = trim(fgets($fp));
// this matches **key** and **header - subheader**
if ( preg_match('#^\*\*([^-]+)(?:-(.*))?\*\*$#', $line, $m) ) {
// only for **header - subheader** $m[2] is set.
if ( isset($m[2]) ) {
return array(TYPE_HEADER, array(trim($m[1]), trim($m[2])));
}
else {
return array(TYPE_KEY, array($m[1]));
}
}
// this matches _____________ and means "new item"
else if ( preg_match('#^_+$#', $line, $m) ) {
return array(TYPE_DELIMETER, array());
}
// any other non-empty line is a single value
else if ( preg_match('#\S#', $line) ) {
// you might want to filter the 1),2),3) part out here
// could also be two diffrent token types
return array(TYPE_VALUE, array($line));
}
else {
// skip empty lines, would be nicer with tail-recursion...
return getstruct($fp);
}
}
打印
<?xml version="1.0"?>
<FOODS>
<FOOD type="TYPE A">
<PRODUCT>1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;</PRODUCT>
<PRODUCT>2) La Fe String Cheese</PRODUCT>
<CODE>Sell by date going back to February 1, 2009</CODE>
<MANUFACTURER>Quesos Mi Pueblito, LLC, Passaic, NJ.</MANUFACTURER>
<VOLUME OF UNITS>11,000 boxes</VOLUME OF UNITS>
<DISTRIBUTION>NJ, NY, DE, MD, CT, VA</DISTRIBUTION>
</FOOD>
<FOOD type="TYPE A">
<PRODUCT>1) Peanut Brittle No Sugar Added;</PRODUCT>
<PRODUCT>2) Peanut Brittle Small Grind;</PRODUCT>
<PRODUCT>3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</PRODUCT>
<CODE>1) Lots 7109 - 8350 inclusive;</CODE>
<CODE>2) Lots 8198 - 8330 inclusive;</CODE>
<CODE>3) Lots 7075 - 9012 inclusive;</CODE>
<CODE>4) Lots 7100 - 8057 inclusive;</CODE>
<CODE>5) Lots 7152 - 8364 inclusive</CODE>
<MANUFACTURER>Star Kay White, Inc., Congers, NY.</MANUFACTURER>
<VOLUME OF UNITS>5,749 units</VOLUME OF UNITS>
<DISTRIBUTION>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</DISTRIBUTION>
</FOOD>
<FOOD type="TYPE B">
<PRODUCT>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;</PRODUCT>
<CODE>990-10/2 10/5</CODE>
<MANUFACTURER>San Mar Manufacturing Corp., Catano, PR.</MANUFACTURER>
<VOLUME OF UNITS>384</VOLUME OF UNITS>
<DISTRIBUTION>PR</DISTRIBUTION>
</FOOD>
</FOODS>
不幸的是,ANTLR的php模块的状态目前是“Runtime is in alpha status.”,但无论如何它都值得一试......
答案 1 :(得分:1)
请参阅:http://www.tuxradar.com/practicalphp/21/5/6
这告诉您如何使用PHP将文本文件解析为标记。解析后,您可以将其放入任何您想要的内容中。
您需要根据您的条件在文件中搜索特定的令牌:
例如: 的产品强>
这为您提供了XML标记
然后1)可以有特殊含义
1)花生脆...
这告诉您要在XML标记中放入什么。
我不知道这是否是完成任务的最有效方法,但它是编译器解析文件的方式,并且有可能非常准确。
答案 2 :(得分:0)
而不是Regex或PHP使用XSLT 2.0 unparsed-text()函数来读取文件(参见http://www.biglist.com/lists/xsl-list/archives/200508/msg00085.html)
答案 3 :(得分:0)
XSLT 1.0解决方案的另一个提示是:http://bytes.com/topic/net/answers/808619-read-plain-file-xslt-1-0-a