Question

我有一个非常大的制表符分隔文件（大约1200万行），如下所示：

我有什么方法可以做到这一点：

F1    1
F1    2
F1    700
F2    89
F2    900
F2    10000
F2    19
F3    100
F3    60001

我尝试过使用sed脚本，但需要很长时间。

例如

sed 's/^/F1/' FILE | cut -c3- > FILE1 ; mv FILE1 FILE

我可以使用

在excel中完成

=IF(a2=="",c1,a2)

然后向下拖动。但Excel只允许我加载一定数量的行。

（假设我已将“F1”复制到C1）

肯定用awk或sed更容易吗？

Answer 1

perl -F'\t' -lane'$h = $F[0] ||= $h; print join "\t", @F'

作业是右关联的，所以

$h = $F[0] ||= $h;

相当于

$h = ( $F[0] ||= $h );

因此

$F[0] ||= $h;
$h = $F[0];

和

$F[0] = $h if !$F[0];
$h = $F[0];

Answer 2

awk救援！

$ awk 'BEGIN {FS=OFS="\t"} 
             {if($1!="") p=$1; else $1=p}1' file

F1      1
F1      2
F1      700
F2      89
F2      900
F2      10000
F2      19
F3      100
F3      60001

这是我使用的输入文件

$ cat -A file

F1^I1$
^I2$
^I700$
F2^I89$
^I900$
^I10000$
^I19$
F3^I100$
^I60001$

Answer 3

Perl命令看起来像这样：

perl -F'\t' -ple '$c1 = $F[0] if $F[0]; $F[0] ||= $c1; $_=join"\t",@F' 40982582.tsv > your_output.tsv

更可读：

#!/usr/bin/perl -pl -F\t

$c1 = $F[0] if $F[0]; # save off the first column if we have one.
$F[0] ||= $c1;        # override empty first-columns.
$_ = join "\t", @F;   # set the topic back to the full line for -p to print

然后运行：

perl yourscript.pl input_file.tsv > output_file.tsv

（您也可以使用＆＃34; -i＆＃34;标志来覆盖文件＆＃34;就位＆＃34;但这并不能在运行时为您节省任何时间或磁盘空间。）

但是，无论文件多长，这需要多长时间。

Answer 4

我建议：

awk -F '\t' '{OFS=FS; $1==""?$1=b:b=$1}1' file

Answer 5

这是sed解决方案：

sed -r -n '/\w+\s+\w+/{p; s/^(\w+\s+).*/\1/; h};/^\w/!{G;s/^\s+(\w+)\s+(\w+\s+)/\2\1/;p}' file.dat
F1    1
F1    2
F1    700
F2    89
F2    900
F2    10000
F2    19
F3    100
F3    60001

消耗时间并与其他awk解决方案进行比较

这是测试代码（bash脚本）

#!/bin/sh

## Input file with data to process
inputfile="bigdata3.txt"

## solutions dir, that contains
## - solution files, and
## - every solution file contains code to evaluate
solutions="solutions/"

file_size_kb=$(du -k "$inputfile" | cut -f1)
echo "Size of input file: $file_size_kb kB"
file_lines_count=$(wc -l $inputfile | sed -r 's/\s*([0-9]+)\s+.*/\1/')
echo "Lines of input file: $file_lines_count"

test_code="time \$code > out.txt"
echo "Test code: '$test_code'"

for solution in $solutions* ; do
    ## output file deletion
    if [ -f out.txt ]; then 
        rm out.txt 
    fi;

    code_content=$(cat $solution)
    code="time $code_content $inputfile > out.txt"
    echo "--------------------------------------------------"
    echo "Solution: $solution"
    echo "Code    : $code"
    res=$(sh -c "cd $PWD; $code")
    echo $res

    ## check correctness of output
    incorrect_lines_count=$(sed -r -n "/^[^[a-zA-Z0-9_]+/p" out.txt |  wc -l | sed -r 's/\s*([0-9]+)\s*.*/\1/')
    total_lines=$(wc -l out.txt | sed -r 's/\s*([0-9]+)\s+.*/\1/') 
    if [ $incorrect_lines_count -eq 0 ] && [ $total_lines -eq $file_lines_count ]; then
        echo "TEST PASSED"
    else
        echo "INVALID SOLUTION:"
        echo " - not processed lines: $incorrect_lines_count (spaces at line beginning)"
        echo " - total processed lines: $total_lines (expecting: $file_lines_count)"
    fi
done;

和结果（对于46kB输入文件）：

Size of input file: 46034 kB
Lines of input file: 8658000
Test code: 'time $code > out.txt'
--------------------------------------------------
Solution: solutions/Cyrus_awk
Code    : time awk -F '\t' '{OFS=FS; $1==""?$1=b:b=$1}1' bigdata3.txt > out.txt

real    0m8.072s
user    0m7.644s
sys     0m0.420s

TEST PASSED
--------------------------------------------------
Solution: solutions/Ed_Morton_awk
Code    : time awk '{sub(/^\t/,p"&");p=$1}1' bigdata3.txt > out.txt

real    0m11.887s
user    0m11.434s
sys     0m0.389s

TEST PASSED
--------------------------------------------------
Solution: solutions/Marek_Nowaczyk_sed
Code    : time sed -r -n '/\w+\s+\w+/{p; s/^(\w+\s+).*/\1/; h};/^\w/!{G;s/^\s+(\w+)\s+(\w+\s+)/\2\1/;p}' bigdata3.txt >
out.txt

real    0m30.239s
user    0m29.577s
sys     0m0.545s

TEST PASSED
--------------------------------------------------
Solution: solutions/Tanktalus_perl
Code    : time perl -F'\t' -ple '$c1 = $F[0] if $F[0]; $F[0] ||= $c1; $_=join"\t",@F'  bigdata3.txt > out.txt

real    0m6.992s
user    0m6.692s
sys     0m0.281s

TEST PASSED
--------------------------------------------------
Solution: solutions/ikeagami_perl
Code    : time perl -F'\t' -lane'$h = $F[0] ||= $h; print join "\t", @F' bigdata3.txt > out.txt

real    0m12.977s
user    0m12.463s
sys     0m0.483s

TEST PASSED
--------------------------------------------------
Solution: solutions/karakfa_awk
Code    : time awk 'BEGIN {FS=OFS="\t"} {if($1!="") p=$1; else $1=p}1'  bigdata3.txt > out.txt

real    0m7.545s
user    0m6.832s
sys     0m0.498s

TEST PASSED
--------------------------------------------------
Solution: solutions/slitvinov_awk
Code    : time awk 'BEGIN   { FS = OFS = "\t" } NF == 1 { print  pre,       $1 } NF == 2 { print (pre = $1), $2 }' bigda
ta3.txt > out.txt

real    0m8.333s
user    0m7.908s
sys     0m0.404s

INVALID SOLUTION:
 - not processed lines: 5772000 (spaces at line beginning)
 - total processed lines: 8658000 (expecting: 8658000)

<强>结论

@Tanktalus perl解决方案效果最佳，但awk @karakfa和awk @Cyrus解决方案也表现不错。

<强> Offtopic

此sed解决方案在较小的文件上具有最佳性能（来自此示例，对于8k文件），但对于较大的数据来说速度非常慢。

Answer 6

$ cat pre.awk
BEGIN   { FS = OFS = "\t" }
NF == 1 { print  pre,       $1 }
NF == 2 { print (pre = $1), $2 }

用法：

$ awk -f pre.awk file.dat

Answer 7

$ awk '{sub(/^\t/,p"&");p=$1}1' file
F1      1
F1      2
F1      700
F2      89
F2      900
F2      10000
F2      19
F3      100
F3      60001

填空字段

7 个答案: