Question

我有一些包含大约285列的大型CSV文件。文件之间有超过一百万行。

要解析每一行，我使用的fgets可以快速运行。从那里开始，我尝试在线路上使用str_getcsv，每条线平均使用0.001421秒。这听起来不是很多，但是一旦你做了1,000,000行，那就是1421秒或大约24分钟。为了加快这个过程，我在尝试解析CSV之前，尽可能多地与字符串进行比较。如果我的检查认为它无关紧要，那么它会跳过这条线。

当我需要索引值来对数据进行更高级的比较时，我的问题出现了。 str_getcsv是最快的选择，还是有更快的方法将线条放入数组？我的第一个想法是使用爆炸，但数据引用了值，一些值也包含逗号。我只需要一次使用一行，如果这有助于任何解析规则。

Answer 1

我最终为自己创建了一个解决方案，但我很好奇是否有其他人可以针对更大的数据集进行测试？使用str_getcsv解析线条平均导致0.0014秒。使用此代码进行解析会导致平均0.0002秒。它绝对可以使用额外的工作来提供更大的灵活性，但是为了使用带引号的简单CSV，这对我的目的来说很好。

function _csv2array($line) {
  $ret = [0 => '']; //Start with an empty array
  $idx = 0; //First index
  $lastpos = 0; //No commas found yet
  while (($pos = strpos($line, ',', $lastpos)) !== FALSE) { //While we find another comma
    $ret[$idx].= substr($line, $lastpos, $pos-$lastpos); //Add it to our current index
    if (substr($ret[$idx], 0, 1) == '"') { //If we started with a quote
      if (substr($ret[$idx], -1) == '"') { //Are we ending in a quote?
        $qts = substr_count($ret[$idx], '"') % 2; //Are there an even number of quotes?
        if (!$qts) { //If there's an even amount of quotes, safe to close out this field
          $ret[$idx] = trim($ret[$idx], '"'); //Remove the outer quotes
          $ret[++$idx] = ''; //Start the next index
        } else $ret[$idx].= ','; //Still inside a quoted field, don't ignore this comma, append it
      } else $ret[$idx].= ','; //Still inside a quoted field, don't ignore this comma, append it
    } else { //Non quoted field
      $ret[++$idx] = ''; //Advance to next index
    }
    $lastpos = $pos+1; //Start our next search AFTER this comma
  }
  $ret[$idx].= substr($line, $lastpos); //Add whatever's after the last ,
  $ret[$idx] = trim($ret[$idx], "\"\r\n"); //Remove any newlines/surrounding quotes
  return $ret; //Return the array
}

在PHP中将大型CSV文件解析为数组的最快方法是什么？

1 个答案: