Question

假设我有两个CSV文件。第一个是格式：

id(unique int),owner_id(non-unique int),string

它包含50-100百万行。几个GB。

第二个格式为：

integer,integer

第二个文件包含十亿行。我想获取文件2的所有行，其中第一和第二列值都存在于第一个文件第二列（owner_id）中的某处。

最有效的方法是在内存中获取owner_id的唯一值，对第二个文件中的每一对进行二进制搜索。我不知道是否可以用BASH完成这样的事情，我可以用python做这件事（提供一个简单的脚本，两个文件，它将读取，加载它们，并用所有有效的对吐出第二个文件）。

但是如果可能的话，我不想添加python的依赖。

Answer 1

由于内存限制，这可能会失败。我调用了文件file1，它有3列，file2有ID 将代码段复制并粘贴到文件中，并根据需要编辑名称。

第一步：使文件1尽可能小。

#/bin/bash
declare -a Array
Count=0

不需要第一个第三列，因此请删除它们，对文件进行排序，然后只获取唯一条目。

InitFile ()
{
while IFS=, read ignore1 stuff ignore2; do  echo $stuff ; done < file1| sort -n | uniq >  $1
}

读入数组：

InitArray ()
{
   while  read  Array[$Count]; do
     let Count++
   done < $1
}

二进制搜索数组中的值：

BinarySearch ()
{
   val=$1
   let idx=$Count/2
   top=$Count
   bottom=0
   while true; do
      if [ ${Array[$idx]} -eq $val ]; then return 0; fi
      lastIdx=$idx
      if [ $top  -le $bottom ]; then return 1; fi
      if [ $val -lt ${Array[$idx]} ]; then top=$idx && let idx=$idx/2;
      elif [ $val -gt ${Array[$idx]} ]; then bottom=$idx && let idx=($top+$bottom)/2; fi
      if [ $idx -eq $lastIdx ]; then let bottom=$bottom+1 ; fi
   done

}

uniqueOwnerIdFile将从第一个文件创建，然后放入数组

InitFile uniqueOwnerIdFile
InitArray uniqueOwnerIdFile

循环遍历第二个文件的每一行，并在所有者ID数组中查找这两个值。回应每个被发现的行TheExistFile。

while IFS=, read firstVal secondVal; do
   if BinarySearch $firstVal && BinarySearch $secondVal ; then echo "$firstVal,$secondVal" ; fi
done < file2 > linesThatExistFile

Answer 2

我不确定纯bash的解决方案，但我可以使用awk提供一个：

awk -F"," 'NR==FNR{col3[$2]++;next;}{ if ($1 in col3 && $2 in col3) print $0} ' File1 File2

首先将第一个文件的第二列读入关联数组，然后查找第二个文件的每一行，它们是否在数组中。

Answer 3

在bash中，这样的事情可能有用。

#!/bin/bash

list=$(cut -f2 -d, file1.txt | sort -u)

while IFS=, read a b; do
  [[ $list =~ $a && $list =~ $b ]] && echo "$a,$b"
done <file2.txt >result.txt

但是，我对表现不太了解。

Answer 4

Perl解决方案。它记住所有所有者在散列中形成文件1，而不是通过文件2并输出两个所有者都存在于散列中的行。

#!/usr/bin/perl
use warnings;
use strict;

open my $F1, '<', 'file1' or die $!;
my %owner;
while (<$F1>) {
    $owner{(split /,/ => $_, 3)[1]} = 1;
}

open my $F2, '<', 'file2' or die $!;
while (my $line = <$F2>) {
    chomp $line;
    print "$line\n" if 2 == grep exists $owner{$_}, split /,/ => $line, 2;
}

Bash管道，它提供相同的输出，但速度明显较慢：

cut -d, -f2 file1 \
    | grep -vwFf- <(sed 's/,/\n/' file2) \
    | grep -vwFf- file2

BASH：如果它们包含在另一个巨大的列表中，则过滤大量的数字

4 个答案: