Question

示例1：

我有一份PDF文档，并在线使用PDF Parser（www.pdfparser.org）以文本格式显示其所有内容。在TXT文件中拯救了内容（手动）并尝试使用正则表达式过滤一些数据，一切正常。

示例2：

为了自动完成这个过程，我下载了PDF Parser API并制作了一个符合以下规则的脚本：

1）使用ParseFile（）API方法转换PDF文本 2）保存TXT的内容 3）尝试使用正则表达式过滤掉这个TXT。

实施例1 - ＆gt;它起作用并归还给我：

array (size = 2)
   'mora_dia' =>
     array (size = 1)
       0 => string 'R $ 3.44' (length = 7)
   'fine' =>
     array (size = 1)
       0 => string 'R $ 17.21' (length = 8)

实施例2 - ＆gt;它不起作用。

array (size = 2)
   'mora_dia' =>
     array (size = 0)
       empty
   'fine' =>
     array (size = 0)
       empty

来自两个TXT的数据相等，但是因为在第二个例子中不起作用？ * （我试图在不保存TXT的情况下这样做但没有奏效）

以下是我的两个例子的代码：

示例1：

$data = file_get_contents('exemplo_01.txt');

$regex = [
    'mora_dia' => '/R\$ [0-9]{1,}\.[0-9]{1,}/i',
    'multa'    => '/R\$ [0-9]{1,}\,[0-9]{1,}/i'
];

foreach($regex as $title => $ex)
{
    preg_match($ex, $data, $matches[$title]);
}

var_dump($matches);

示例2：

$parser = new \Smalot\PdfParser\Parser();
    $pdf = $parser->parseFile($PDFFile);
    $pages = $pdf->getPages();

    foreach ($pages as $page) {
        $PDFParse = $page->getText();
    }

    $txtName = __DIR__ . '/files/Txt/' . md5(uniqid(rand(), true)) . '.txt';
    $file  = fopen($txtName, 'w+');
    fwrite($file, $PDFParse);
    fclose($file);

    $dataTxt = file_get_contents($txtName);

    $regex = [
        'mora_dia' => '/R\$ [0-9]{1,}\.[0-9]{1,}/i',
        'multa'    => '/R\$ [0-9]{1,}\,[0-9]{1,}/i'
    ];

    foreach($regex as $title => $ex)
    {
        preg_match($ex, $dataTxt, $matches[$title]);
    }

Answer 1

 $PDFParse ='';
 foreach ($pages as $page) {
     $PDFParse = $PDFParse.$page->getText();
 }

如果PDFParse是字符串并且在fwrite之后尝试fflush（$ file）

Answer 2

您手动复制和粘贴输出文本的操作似乎实际上已更改其内容。根据pastebin输出，直接到文件版本包含不间断的空格字符而不是常规空格。非中断空格具有十六进制代码0xA0，ascii 160，而不是常规空格，十六进制0x20 ascii 32。

事实上，看起来好像 all 直接文件示例中的空格字符是不间断的0xA0空格。

要将正则表达式改为能够容纳任何类型的空间，可以将十六进制代码与常规空格字符[]一起放入' '字符类，如[ \xA0]匹配任何一种类型。此外，您需要/u标志才能使用unicode。

$regex = [
    'mora_dia' => '/R\$[ \xA0][0-9]{1,}\.[0-9]{1,}/iu',
    'multa'    => '/R\$[ \xA0][0-9]{1,},[0-9]{1,}/iu'
];

（注意，,逗号不需要反斜杠转义）

这可以正常使用原始的pastebin作为输入：

$str = file_get_contents('http://pastebin.com/raw.php?i=H7D5xJBH');
preg_match('/R\$[ \xa0][0-9]{1,}\.[0-9]{1,}/ui', $str, $matches);
var_dump($matches);

// Prints:
array(1) {
  [0] =>
  string(8) "R$ 3.44"
}

另一种解决方案可能是在应用原始正则表达式之前，用整个文本中的常规空格替换不间断空格：

// Replace all non-breaking spaces with regular spaces in the
// text string read from the file...
// The unicode non-breaking space is represented by 00A0
// and both are needed to replace this successfully.
$dataTxt = str_replace("\x00\xA0", " ", $dataTxt);

每当您输入时，您希望它们在视觉上看是完全相同的，请务必使用能够显示每个字符十六进制代码的工具进行检查。在这种情况下，我将您的样本从pastebin复制到文件中并使用Vim进行检查，其中我为光标下的字符设置了十六进制和ascii显示。

preg_match（）+ regex在TXT文件中不起作用

2 个答案: