我有一个非常大的制表符分隔文件(大约1200万行),如下所示:
F1 1
2
700
F2 89
900
10000
19
F3 100
60001
我有什么方法可以做到这一点:
F1 1
F1 2
F1 700
F2 89
F2 900
F2 10000
F2 19
F3 100
F3 60001
我尝试过使用sed脚本,但需要很长时间。
例如
sed 's/^/F1/' FILE | cut -c3- > FILE1 ; mv FILE1 FILE
我可以使用
在excel中完成=IF(a2=="",c1,a2)
然后向下拖动。但Excel只允许我加载一定数量的行。
(假设我已将“F1”复制到C1)
肯定用awk或sed更容易吗?
答案 0 :(得分:6)
perl -F'\t' -lane'$h = $F[0] ||= $h; print join "\t", @F'
作业是右关联的,所以
$h = $F[0] ||= $h;
相当于
$h = ( $F[0] ||= $h );
因此
$F[0] ||= $h;
$h = $F[0];
和
$F[0] = $h if !$F[0];
$h = $F[0];
答案 1 :(得分:4)
awk
救援!
$ awk 'BEGIN {FS=OFS="\t"}
{if($1!="") p=$1; else $1=p}1' file
F1 1
F1 2
F1 700
F2 89
F2 900
F2 10000
F2 19
F3 100
F3 60001
这是我使用的输入文件
$ cat -A file
F1^I1$
^I2$
^I700$
F2^I89$
^I900$
^I10000$
^I19$
F3^I100$
^I60001$
答案 2 :(得分:2)
Perl命令看起来像这样:
perl -F'\t' -ple '$c1 = $F[0] if $F[0]; $F[0] ||= $c1; $_=join"\t",@F' 40982582.tsv > your_output.tsv
更可读:
#!/usr/bin/perl -pl -F\t
$c1 = $F[0] if $F[0]; # save off the first column if we have one.
$F[0] ||= $c1; # override empty first-columns.
$_ = join "\t", @F; # set the topic back to the full line for -p to print
然后运行:
perl yourscript.pl input_file.tsv > output_file.tsv
(您也可以使用" -i"标志来覆盖文件"就位"但这并不能在运行时为您节省任何时间或磁盘空间。 )
但是,无论文件多长,这需要多长时间。
答案 3 :(得分:2)
我建议:
awk -F '\t' '{OFS=FS; $1==""?$1=b:b=$1}1' file
答案 4 :(得分:2)
这是sed
解决方案:
sed -r -n '/\w+\s+\w+/{p; s/^(\w+\s+).*/\1/; h};/^\w/!{G;s/^\s+(\w+)\s+(\w+\s+)/\2\1/;p}' file.dat
F1 1
F1 2
F1 700
F2 89
F2 900
F2 10000
F2 19
F3 100
F3 60001
消耗时间并与其他awk解决方案进行比较
这是测试代码(bash脚本)
#!/bin/sh
## Input file with data to process
inputfile="bigdata3.txt"
## solutions dir, that contains
## - solution files, and
## - every solution file contains code to evaluate
solutions="solutions/"
file_size_kb=$(du -k "$inputfile" | cut -f1)
echo "Size of input file: $file_size_kb kB"
file_lines_count=$(wc -l $inputfile | sed -r 's/\s*([0-9]+)\s+.*/\1/')
echo "Lines of input file: $file_lines_count"
test_code="time \$code > out.txt"
echo "Test code: '$test_code'"
for solution in $solutions* ; do
## output file deletion
if [ -f out.txt ]; then
rm out.txt
fi;
code_content=$(cat $solution)
code="time $code_content $inputfile > out.txt"
echo "--------------------------------------------------"
echo "Solution: $solution"
echo "Code : $code"
res=$(sh -c "cd $PWD; $code")
echo $res
## check correctness of output
incorrect_lines_count=$(sed -r -n "/^[^[a-zA-Z0-9_]+/p" out.txt | wc -l | sed -r 's/\s*([0-9]+)\s*.*/\1/')
total_lines=$(wc -l out.txt | sed -r 's/\s*([0-9]+)\s+.*/\1/')
if [ $incorrect_lines_count -eq 0 ] && [ $total_lines -eq $file_lines_count ]; then
echo "TEST PASSED"
else
echo "INVALID SOLUTION:"
echo " - not processed lines: $incorrect_lines_count (spaces at line beginning)"
echo " - total processed lines: $total_lines (expecting: $file_lines_count)"
fi
done;
和结果(对于46kB输入文件):
Size of input file: 46034 kB
Lines of input file: 8658000
Test code: 'time $code > out.txt'
--------------------------------------------------
Solution: solutions/Cyrus_awk
Code : time awk -F '\t' '{OFS=FS; $1==""?$1=b:b=$1}1' bigdata3.txt > out.txt
real 0m8.072s
user 0m7.644s
sys 0m0.420s
TEST PASSED
--------------------------------------------------
Solution: solutions/Ed_Morton_awk
Code : time awk '{sub(/^\t/,p"&");p=$1}1' bigdata3.txt > out.txt
real 0m11.887s
user 0m11.434s
sys 0m0.389s
TEST PASSED
--------------------------------------------------
Solution: solutions/Marek_Nowaczyk_sed
Code : time sed -r -n '/\w+\s+\w+/{p; s/^(\w+\s+).*/\1/; h};/^\w/!{G;s/^\s+(\w+)\s+(\w+\s+)/\2\1/;p}' bigdata3.txt >
out.txt
real 0m30.239s
user 0m29.577s
sys 0m0.545s
TEST PASSED
--------------------------------------------------
Solution: solutions/Tanktalus_perl
Code : time perl -F'\t' -ple '$c1 = $F[0] if $F[0]; $F[0] ||= $c1; $_=join"\t",@F' bigdata3.txt > out.txt
real 0m6.992s
user 0m6.692s
sys 0m0.281s
TEST PASSED
--------------------------------------------------
Solution: solutions/ikeagami_perl
Code : time perl -F'\t' -lane'$h = $F[0] ||= $h; print join "\t", @F' bigdata3.txt > out.txt
real 0m12.977s
user 0m12.463s
sys 0m0.483s
TEST PASSED
--------------------------------------------------
Solution: solutions/karakfa_awk
Code : time awk 'BEGIN {FS=OFS="\t"} {if($1!="") p=$1; else $1=p}1' bigdata3.txt > out.txt
real 0m7.545s
user 0m6.832s
sys 0m0.498s
TEST PASSED
--------------------------------------------------
Solution: solutions/slitvinov_awk
Code : time awk 'BEGIN { FS = OFS = "\t" } NF == 1 { print pre, $1 } NF == 2 { print (pre = $1), $2 }' bigda
ta3.txt > out.txt
real 0m8.333s
user 0m7.908s
sys 0m0.404s
INVALID SOLUTION:
- not processed lines: 5772000 (spaces at line beginning)
- total processed lines: 8658000 (expecting: 8658000)
<强>结论强>
@Tanktalus perl
解决方案效果最佳,但awk
@karakfa和awk
@Cyrus解决方案也表现不错。
<强> Offtopic 强>
此sed
解决方案在较小的文件上具有最佳性能(来自此示例,对于8k文件),但对于较大的数据来说速度非常慢。
答案 5 :(得分:1)
$ cat pre.awk
BEGIN { FS = OFS = "\t" }
NF == 1 { print pre, $1 }
NF == 2 { print (pre = $1), $2 }
用法:
$ awk -f pre.awk file.dat
答案 6 :(得分:1)
$ awk '{sub(/^\t/,p"&");p=$1}1' file
F1 1
F1 2
F1 700
F2 89
F2 900
F2 10000
F2 19
F3 100
F3 60001