Question

我有一个包含数百万行的文本文件，应该尽快导入MySQL表。据我所知，LOAD DATA最适合这一点。

数据格式如下，括号中的每个大写字母都是一个字符串：

(A)(1-3 tabs)(B)
(3 tabs)(C)
(3 tabs)(D)
(3 tabs)(E)

(F)(1-3 tabs)(G)
(3 tabs)(H)
...

因此需要将数据重新格式化为CSV，其中每个部分的第一个字符串必须在所有连续行中重复，直到下一部分：

(A)(tab)(B)
(A)(tab)(C)
(A)(tab)(D)
(A)(tab)(E)
(F)(tab)(G)
(F)(tab)(H)
...

我正在考虑编写一个C程序，但Bash能否同样快速（简单）地完成它？这个问题可能是一个非常有效和紧凑的解决方案吗？

Answer 1

试试这个小awk脚本

awk -F\\t+ -v OFS=\\t '$2==""{next}$1!=""{a=$1}{$1=a}1'

它假定第二个字段中没有制表符。

逐一拍摄：

-F\\t+        Set the column separator to a sequence of one or more tabs
-v OFS=\\t    Use a tab to separate columns on output
$2==""{next}  Skip this line if it just has one field.
$1!=""{a=$1}  Save the first field if it is specified
{$1=a}        Replace the first field with the saved one.
              The assignment forces the line to be recomputed using OFS
              to separate columns, so it's needed even if we just did a=$1.
1             awk idiom, equivalent to `{print}` (or `{print $0}`).

Answer 2

这是Perl脚本的工作;这是给你的。经过轻微测试，获取一个文件名列表作为命令行参数进行操作和/或从stdin读取，写入stdout。假设选项卡的实际数量无关紧要，并且该行上只有一个或两个非空字段。（它会抱怨并跳过任何不符合预期格式的行。）

#! /usr/bin/perl

our $left;
while (<>) {
    chomp;
    if (/^([^\t]+)\t+([^\t]+)$/) {
        $left = $1;
        printf("%s\t%s\n", $left, $2);
    } elsif (/^\t+([^\t]+)$/) {
        if (defined $left) {
            printf("%s\t%s\n", $left, $1);
        } else {
            warn "$ARGV:$.: continuation line before leader line\n";
        }
    } else {
        warn "$ARGV:$.: line in unrecognized format\n";
    }
} continue {
    close ARGV if eof; # reset line numbering for each input file
}

shell脚本（特定于bash或其他）将数量级更慢。

Answer 3

为了完整起见，这里有一个非常简单的＆＃34; C＆＃34; （实际上是flex）解决方案，可能更接近最快的解决方案。

file：tsv.l

%option noinput nounput noyywrap nodefault
%x SECOND
%%
  char* saved = NULL;
\t+            BEGIN SECOND;
[^\t\n]+       free(saved); saved = malloc(yyleng + 1); strcpy(saved, yytext);
<*>\n          BEGIN INITIAL;
<SECOND>.*     printf("%s\t%s\n", saved, yytext); BEGIN INITIAL;

编译：

flex --batch -8 -CF -o tsv.c tsv.l
gcc -O3 -march=native -Wall -o tsv tsv.c -lfl
# On Mac OS  X, change -lfl to -ll

我测试了这个，https://stackoverflow.com/a/39438585/1566221中的awk脚本和https://stackoverflow.com/a/39438587/1566221中的perl脚本，包含由空行分隔的包含91,073个节的1,000,000个非空白行的样本输入。总共，该文件有201,675,114字节。 Ubuntu 14.04系统上的计时：

flex：0.85秒
awk：1.40秒
perl：3.85秒

在所有情况下，这是time prog < test.text > /dev/null报告的用户时间，最少五次运行并四舍五入为0.05秒。

我修改了perl脚本以忽略空行，方法是在chomp;条件的if (length) { ... }命令之后将循环体封闭。它对执行时间的影响非常小，但有必要避免忽略生成的警告。

我通常不会使用＆＃34;速度＆＃34; flex程序上的标志，但在这种情况下，它实际上确实产生了显着的差异;没有它们，flex程序花了将近2秒，远远超过了awk脚本。

Answer 4

我尝试了C实现。 3米线约1s。但是，if(line[0] == '\t') { // TODO: remove preceding `\t{3}` and trailing `\r` printf("%s\t%s\n", one, line); } else { // TODO: split at \t{1,3} and remove trailing `\r` sscanf(line, "%s\t%s", one, two); printf("%s\t%s\n", one, two); }显然无关紧要地对待空白，而我的C有点生疏。如何在没有大量代码的情况下正确提取字符串？

Before:
  compile 'com.google.firebase:firebase-database:9.2.0'
    compile 'com.google.firebase:firebase-storage:9.2.0'
    compile 'com.firebaseui:firebase-ui-database:0.4.0'
    compile 'com.squareup.picasso:picasso:2.5.2'
    compile 'com.google.firebase:firebase-auth:9.0.2'

After:  compile 'com.google.firebase:firebase-database:9.2.0'
    compile 'com.google.firebase:firebase-storage:9.2.0'
    compile 'com.firebaseui:firebase-ui-database:0.4.0'
    compile 'com.squareup.picasso:picasso:2.5.2'
    compile 'com.google.firebase:firebase-auth:9.2.0'

将数百万行重新格式化为CSV的最快方法

4 个答案:

file：tsv.l