如何将非结构化文本文件转换为PHP多维数组

时间:2013-08-19 16:21:54

标签: php regex

我有100个文件,每个文件包含x个新闻文章。这些文章的章节结构如下:

HD BY WC PD SN SC PG LA CY LP TD CO IN NS RE IPC PUB AN

其中[LP][TD]可以包含任意数量的段落。

典型的消息如下:

HD Corporate News: Alcoa Earnings Soar; Outlook Stays Upbeat 
BY By James R. Hagerty and Matthew Day 
WC 421 words
PD 12 July 2011
SN The Wall Street Journal
SC J
PG B7
LA English
CY (Copyright (c) 2011, Dow Jones & Company, Inc.) 

LP 

Alcoa Inc.'s profit more than doubled in the second quarter, but the giant 
aluminum producer managed only to meet analysts' recently lowered forecasts.

Alcoa serves as a bellwether for U.S. corporate earnings because it is the 
first major company to report and draws demand from a wide range of 
industries.

TD 

The results marked an early test of how corporate optimism is holding up 
in the face of bleak economic news.

License this article from Dow Jones Reprint 
Service[http://www.djreprints.com/link/link.html?FACTIVA=wjco20110712000115]

CO 
almam : ALCOA Inc

IN 
i2245 : Aluminum | i22 : Primary Metals | i224 : Non-ferrous Metals | imet 
  : Metals/Mining

NS 
c15 : Performance | c151 : Earnings | c1521 : Analyst 
Comment/Recommendation | ccat : Corporate/Industrial News | c152 : 
Earnings Projections | ncat : Content Types | nfact : Factiva Filters | 
nfce : FC&E Exclusion Filter | nfcpin : FC&E Industry News Filter

RE 
usa : United States | use : Northeast U.S. | uspa : Pennsylvania | namz : 
North America

IPC 
DJCS | EWR | BSC | NND | CNS | LMJ | TPT

PUB 
Dow Jones & Company, Inc.

AN 
Document J000000020110712e77c00035

在每篇文章之后,在新文章开始之前有4个换行符。我需要把这些文章放到一个数组中,如下所示:

$articles = array(
  [0] = array (
    [HD] => Corporate News: Alcoa earnings Soar; Outlook...
    [BY] => By James R. Hagerty...
    ...
    [AN] => Document J000000020110712e77c00035
  )
)

[edit] 感谢@Casimir et Hippolyte,我现在有:

$path = "C:/path/to/textfiles/";

if ($handle = opendir($path)) {
  while (false !== ($file = readdir($handle))) {
    if ('.' === $file) continue;
    if ('..' === $file) continue;

    $text = file_get_contents($path . $file);
    $subjects = explode("\r\n\r\n\r\n\r\n", $text);

    $pattern = <<<'LOD'
        ~
        # definition
        (?(DEFINE)(?<fieldname>(?<=^|\n)(?>HD|BY|WC|PD|SN|SC|PG|LA|CY|LP|TD|CO|IN|NS|RE|IPC|PUB|AN)))
        # pattern
        \G(?<key>\g<fieldname>)\s++(?<value>[^\n]++(?>\n{1,2}+(?!\g<fieldname>) [^\n]++ )*+)(?>\n{1,3}|$)
        ~x 
LOD;

    $result = array();
    foreach($subjects as $i => $subject) {
      if (preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER)) {
        foreach ($matches as $match) {
          $result[$i][$match['key']] = $match['value'];
        }
      }
    }
  }
  closedir($handle);
  echo '<pre>';
  print_r($result);
}

但是,没有找到匹配项,也没有产生任何错误。有人可以问我这里出了什么问题吗?

1 个答案:

答案 0 :(得分:3)

使用 explode 分隔每个块的方法和使用正则表达式提取字段的方法:

$pattern = <<<'LOD'
~
# definition
(?(DEFINE)
    (?<fieldname> (?<=^|\n)
                  (?>HD|BY|WC|PD|SN|SC|PG|LA|CY|LP|TD|CO|IN|NS|RE|IPC|PUB|AN)
    )
)

# pattern
\G(?<key>\g<fieldname>) \s++
(?<value>
    [^\n]++ 
    (?> \n{1,2}+ (?!\g<fieldname>) [^\n]++ )*+
)
(?>\n{1,3}|$)
~x
LOD;
$subjects = explode("\n\n\n\n", $text);
$result = array();

foreach($subjects as $i=>$subject) {
    if (preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER)) {
        foreach ($matches as $match) {
            $result[$i][$match['key']]=$match['value'];
        }
    }
}
echo '<pre>';
print_r($result);

模式细节:

该模式分为两部分:

  • 定义:您可以在哪里编写子模式以供以后使用
  • 模式本身

在定义部分,我编写了一个名为 fieldname 的子模式,其中我将所有字段名称和条件放在开头。条件检查fiedname是否以字符串的开头(^)或换行符(\n)开头,以避免在段落中捕获相同的字母。

模式部分的描述:

\G                        # this forces the match to be contiguous to the
                          # precedent match or the start of the string (no gap)
(?<key> \g<fieldname> )   # a capturing group named "key" for the fieldname
\s++                      # one or more white characters
(?<value>                 # open a capturing group named "value" for the
                          # field content
    [^\n]++               # all characters except newlines 1 or more times
    (?>                   # open an atomic group
        \n{1,2}+          # one or two newlines to allow paragraphs (LP & TD) 
        (?!\g<fieldname>) # but not followed by a fieldname (only a check)
        [^\n]++           # all characters except newlines 1 or more times
    )*+                   # close the atomic group and repeat 0 or more times
)                         # close the capture group "value"
(?>\n{1,3}|$)             # between 1 or 3 newlines max. or the end of the
                          # string (necessary if i want contiguous matches)

$ pattern 末尾的x允许使用正则表达式中的详细模式(您可以使用#放置注释,并且可以根据需要使用空格格式化代码)

注意:此模式不关心字段顺序以及它们是否存在。 为了更具可读性,我使用nowdoc syntax<<<'ABC')小心使用它。

如果您的文本文件具有换行符的窗口格式(即\r\n),则必须将模式更改为:

$pattern = <<<'LOD'
~
# definition
(?(DEFINE)
    (?<fieldname> (?<=^|\n)
                  (?>HD|BY|WC|PD|SN|SC|PG|LA|CY|LP|TD|CO|IN|NS|RE|IPC|PUB|AN)
    )
)

# pattern
\G(?<key>\g<fieldname>) \s++
(?<value>
    [^\r\n]++ 
    (?> (?>\r?\n){1,2}+ (?!\g<fieldname>) [^\r\n]++ )*+
)
(?>(?>\r?\n){1,3}|$)
~x
LOD;
$subjects = explode("\r\n\r\n\r\n\r\n", $text);