正则表达式只从MBOX文件中提取联系人信息到csv

时间:2015-03-25 10:37:37

标签: php arrays regex csv

正如标题所示,我试图解析一个大的MBOX文件(一个文件中有16,000封电子邮件),而我正在处理一个小文件进行测试。

到目前为止,我的PHP是:

$string = file_get_contents("test.mbox");

$matches = array(); //create array

$patt = '/name:\s([^\r]+)|email:\s([^\r]+)/';

preg_match_all($patt, $string, $matches); //find matching pattern

print_r($matches);

$fp = fopen('test.csv', 'w');

foreach ($matches as $fields) {

    fputcsv($fp, $fields);

}

fclose($fp);

但我的输出需要采用易于导入的格式。 目前我的正则表达式返回:

Array (
    [0] => Array (
        [0] => name: Andrew
        [1] => email: andrew@gmail.com
        [2] => name: Second Dude
        [3] => email: second@gmail.com.au
        [4] => name: Stuart Richards
        [5] => email: stuart@gmail.com
        [6] => name: Stuart Richards2
        [7] => email: stuart2@gmail.com
        [8] => name: Stuart Richards3
        [9] => email: stuart3@gmail.com )
    [1] => Array (
        [0] => Andrew
        [1] =>
        [2] => Second Dude
        [3] =>
        [4] => Stuart Richards
        [5] =>
        [6] => Stuart Richards
        [7] =>
        [8] => Stuart Richards
        [9] => )
    [2] => Array (
        [0] => 
        [1] => andrew@gmail.com
        [2] =>
        [3] => second@gmail.com.au
        [4] =>
        [5] => stuart@gmail.com
        [6] =>
        [7] => stuart2@gmail.com
        [8] =>
        [9] => stuart3@gmail.com ) )

这是我想要的数据,但是以CSV形式显示(就像每个字段的交叉表查询一样)。 顶行包含一个字符串,如“name:Andrew,email:andrew@gmail.com等 csv中的第二行只包含名称:“Andrew ,, Second Dude ,,”等,它们在每一列都匹配。 第三部分仅包含电子邮件:“,andrew @ gmail.com,second @ gmail.com ,,

我有16000封电子邮件,其中包含名称:电子邮件:以及其中的两个其他标题,并且希望能够轻松导入到我的数据库中,因此我需要一个包含每个数据的csv: NAME1,EMAIL1,PHONE1 NAME2,EMAIL2,PHONE2 NAME3,EMAIL3,电话3

有人可以帮帮我吗?我已经尝试了很多东西,包括在交叉表格式输出但没有运气的情况下处理文件。 我尝试在每个正则表达式之后添加一个换行符,但没有运气。

我只是在周末刚开始使用php并且已经使用了这个网站很多!所以如果你能指出我正确的方向来学习我想做的语法,我将不胜感激。我已经达到了点击我已经阅读了十次的资源链接的地步,所以我想请求一些帮助。 干杯 安德鲁

我的测试mbox文件示例:

---------- Forwarded message ----------
From: 
Date: Sat, Jan 3, 2015 at 9:38 AM
Subject: campaign Campaign (.INFO)
To: 


Visitor's IP: 58.165.117.
name: Andrew Cowley
suburb: Victoria point
email: andrew@gmail.com
phone: 04035752
powerbill: $500
System_Required: 
Date:Sat-Jan-2015 10:38:00
Key:

from:  - landing page



---------- Forwarded message ----------
From: 
Date: Sat, Jan 3, 2015 at 9:38 AM
Subject: campaign Campaign (.INFO)
To:


Visitor's IP: 58.165.117.
name: Second Dude
suburb: Victoria point
email: second@gmail.com.au
phone: 04035752
powerbill: $500
System_Required: 3kW
Date:Sat-Jan-2015 10:38:00
Key:

from: Adwords  - landing page



---------- Forwarded message ----------
From: 
Date: Sat, Jan 3, 2015 at 9:38 AM
Subject: campaign Campaign (.INFO)
To: 


Visitor's IP: 58.165.117.
name: Stuart Richards
suburb: Victoria point
email: mottu@gmail.com
phone: 04035752
powerbill: $500
System_Required: 3kW
Date:Sat-Jan-2015 10:38:00
Key:

from: Adwords  - landing page

1 个答案:

答案 0 :(得分:0)

一个想法是构建一个模式,在一个匹配中提取两个信息(名称和电子邮件),并使用选项PREG_SET_ORDER将匹配的所有信息(整个匹配和捕获组)放入相同的结果数组项:

$pattern = '~
^name: \h* (?<name> [^\r\n]+ ) \R
.* \R                              # skip the suburb line
email: \h* (?<mail> [^\r\n]+ )
~mx';

if (preg_match_all($pattern, $mbox, $m, PREG_SET_ORDER)) {
    foreach($m as $item) {
        echo $item['name'] . ',' . $item['mail'] . PHP_EOL;
    }
}

词汇表:

\R             # any kind of newlines 
\h             # an horizontal whitespace
(?<blah>...)   # a named capture
^              # is by default an anchor for the start of the string,
               # but when the m modifier is used, it becomes an anchor 
               # for the start of the line.
m modifier     # change the meaning of ^ and $
x modifier     # switch on the free spacing mode (or comment mode, or verbose mode)

您可以轻松更改此模式以添加其他字段。

注意:如果邮件或名称有时可能为空,您可以将[^\r\n]+更改为[^\r\n]*

您可以使用foreach循环在数据库中插入值,而不是构建csv。我建议在循环之前启动一个事务:

if (preg_match_all($pattern, $mbox, $m, PREG_SET_ORDER)) {
    try {  
        $dbh = new PDO("mysql:host=$hostname;dbname=$dbname", $usr, $pwd);
        $dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);      
        $dbh->beginTransaction();
        $query = 'INSERT INTO myTable (name, mail) VALUES (?, ?)';
        $sth = $dbh->prepare($query);

        foreach($m as $item) {
            $sth->execute(array($item['name'], $item['mail']));
        }

        $dbh->commit();
    } catch(PDOException $e) {
        $dbh->rollback();
        echo "Error: " . $e->getMessage();
    }
    $dbh = null;
}