Question

我必须执行的一项常见任务是在两个文本文件上使用类似SQL的JOIN。即，使用在它们之间共享的标识符列上的某种连接，从“左手”和“右手”文件创建新文件。有时需要诸如外连接等变体。

当然我可以用一般的方式编写一个简单的脚本来执行此操作，但是有一个python模块 - 内置或可安装 - 可以做到这一点吗？可以处理大文件的东西是理想的。

编辑：

我知道PyTables，但这是最简单的平面文本文件解决方案吗？
“巨大的文件”我的意思是有时“左手”文件太大而无法存储在内存中
缺少（到目前为止）python的答案让我很担心。我是否使用了错误的工具/范例？我要求python lib的原因是允许在每一行上轻松添加其他转换（验证标识符等）。

Answer 1

[狂野的想法]

这些文件是否适合您系统的内存并留下足够的空间？在这种情况下，您可以使用SQLite将它们加载到表中，然后使用SQL本身将它们连接到您的内容。

[/ wild idea]

<强>更新

~~抓它。 OP表示其中一个文件太大而无法存储在内存中。~~。 answer见@Dave Kirby。 SQLite可以与磁盘数据库一起使用。

Answer 2

如果您正在使用unixy系统或cygwin，那么请查看join命令 - 它可能完全符合您的要求。

[26] % join --help
Usage: join [OPTION]... FILE1 FILE2
For each pair of input lines with identical join fields, write a line to
standard output.  The default join field is the first, delimited
by whitespace.  When FILE1 or FILE2 (not both) is -, read standard input.

  -a FILENUM        print unpairable lines coming from file FILENUM, where
                      FILENUM is 1 or 2, corresponding to FILE1 or FILE2
  -e EMPTY          replace missing input fields with EMPTY
  -i, --ignore-case ignore differences in case when comparing fields
  -j FIELD          equivalent to `-1 FIELD -2 FIELD'
  -o FORMAT         obey FORMAT while constructing output line
  -t CHAR           use CHAR as input and output field separator
  -v FILENUM        like -a FILENUM, but suppress joined output lines
  -1 FIELD          join on this FIELD of file 1
  -2 FIELD          join on this FIELD of file 2
      --help     display this help and exit
      --version  output version information and exit

Unless -t CHAR is given, leading blanks separate fields and are ignored,
else fields are separated by CHAR.  Any FIELD is a field number counted
from 1.  FORMAT is one or more comma or blank separated specifications,
each being `FILENUM.FIELD' or `0'.  Default FORMAT outputs the join field,
the remaining fields from FILE1, the remaining fields from FILE2, all
separated by CHAR.

Important: FILE1 and FILE2 must be sorted on the join fields.

Report bugs to <bug-coreutils@gnu.org>.

如果你想要更复杂的东西或者你必须在python中完成它，那么考虑将这些文件读入内存中的SQLite数据库 - 然后你可以充分利用SQL来合并和操作数据。

编辑只是读取文件太大而无法放入内存中。您仍然可以使用SQLite，但可以创建临时的磁盘数据库。

在Python中的两个文本文件上类似于SQL的JOIN，是否有内置方式？

2 个答案: