Question

需要提取大量信息，即

文件1：

 10948|Book|Type1

file2：

SHA512||0||10948

file3的：

0|10948|SHA512|c3884fbd7fc122b5273262b7a0398e63

我想把它变得像

那样

 c3884fbd7fc122b5273262b7a0398e63|SHA512|Type1|Book

我无法访问实际的数据库，有没有办法做到这一点？基本上寻找一个$id = $file1[0]; if($file3[1] == $id)或其他东西更有效率。

每个CSV文件的行数为100k-300k。我不在乎是否需要一段时间，我可以让它在EC2上运行一段时间。

Answer 1

$data = array();

$fh = fopen('file1') or die("Unable to open file1");
while(list($id, $val1, $val2) = fgetcsv($fh, 0, '|')) {
   $data[$id]['val1'] = $val1;
   $data[$id]['val2'] = $val2;
}
fclose($fh);

$fh = fopen('file2') or die ("Unable to open file2");
while(list($method, null, null, null, $id) = fgetcsv($fh, 0, '|')) {
   $data[$id]['method'] = $method;
}
fclose($fh);

$fh = fopen('file3') or die("Unable to open file3");
while(list(null, $id, null, $hash) = fgetcsv($fh, 0, '|')) {
   $data[$id]['hash'] = $hash;
}
fclose($fh);

乏味，但是你应该得到一个包含你想要的数据的数组。输出它作为另一个csv留给读者练习（提示：见fputcsv()）。

Answer 2

所有三个文件似乎都有一个公共字段（例如，在您的示例中，“10948”对于所有三个行都是通用的）。如果您不担心使用大量内存，可以将所有三个文件加载到不同的数组中，将公共字段设置为数组键，并使用foreach循环重新组合这三个文件。

例如：

$result = array();

// File 1
$fh = fopen('file1');

while ( ($data = fgetcsv($fh, 0, '|')) !== FALSE )
  $result[$data[0]] = $data;

fclose($fh); 

// File 2
$fh = fopen('file2')

while ( ($data = fgetcsv($fh, 0, '|')) !== FALSE )
  $result[$data[5]] = array_merge($result[$data[3]], $data);

fclose($fh); 

// File 3
$fh = fopen('file3')

while ( ($data = fgetcsv($fh, 0, '|')) !== FALSE )
  $result[$data[1]] = array_merge($result[$data[1]], $data);

fclose($fh);

Answer 3

我建议使用基本的unix工具执行合并排序：
a）按每个文件之间通用的列对.CSV文件进行排序，排序-d“” - K？ -K？ -K？
b）使用unix'join'命令输出.CSV文件对之间共同的记录。 'join'命令一次只能处理2个文件，因此您必须为多个数据源“链接”结果：

  # where 'x' is field number from file A, and 'y' is field number from file B
  sort -kx "fileA" 
  sort -ky "fileB"
  join -1x -2y  "fileA" "fileB" > file1

  sort -kx "fileC"
  join -1x -2y "file1" "fileC" > file2

  sort -kx "fileD"
  join -1x -2y "file2" "fileD" > file3
  etc...

这非常快，并允许您过滤.CSV文件，就像发生即兴数据库连接一样。

如果你必须在php中编写自己的合并排序:(阅读：Merge Sort））

合并排序.CSV文件的最简单实现是2阶段：a）unix排序你的文件，然后B）并行“合并”所有源，读取每个源的记录，查找大小写的情况您在公共字段中的值与所有其他来源匹配（数据库术语中的JOIN）：
规则1）跳过小于（＆lt;）所有其他来源的记录。
规则2）当记录的公共值等于（==）时，所有其他来源都有匹配。
规则3）当记录的公共值等于（==）是其他来源的某个时，如果需要，可以使用“LEFT-JOIN”逻辑，否则从所有来源跳过该记录。

多个文件连接的伪代码

read 1st record from every data source;
while "record exists from all data sources"; do
    for A in each Data-Source ; do
        set cntMissMatch=0
        for B in each Data-Source; do
            if A.field < B.field then
               cntMissMatch+=1
            end if
        end for

        if cntMissMatch == count(Data-Sources) then
            # found record with lowest values, skip it
            read next record in current Data-source; 
            break;  # start over again looking for lowest
        else 
            if cntMissMatch == 0 then
                we have a match, process this record;
                read in next record from ALL data-sources ;
                break; # start over again looking for lowest
            else
                # we have a partial match, you can choose to have
                # 'LEFT-JOIN' logic at this point if you choose, 
                # where records are spit out even if they do NOT 
                # match to ALL data-sources. 
            end if         
        end if
    end for       
done

希望有所帮助。

读取多个CSV文件

3 个答案: