我对相当大的csv文件有点问题。我能够编写简单的bash / awk脚本,但这个问题对我有限的awk / bash编程经验来说更难。
问题:
我的所有文件都在文件夹中。文件夹有偶数个csv文件,需要成对修剪(我会用这个方法解释)。文件名称如下:f1L,f1R,f2L,f2R,f3L,f3R,...,fnL,fnR。
文件需要成对阅读,即。 f1L与f1R。 f2L与f2R等
文件有两个以逗号分隔的字段。 f1L(文件开始/结束)和f1R,看起来像
f1L (START) 1349971210, -0.984375 1349971211, -1.000000 f1R (START) 1349971206, -0.015625 1349971207, 0.000000 f1L (END) 1350230398, 0.500000 1350230399, 0.515625 f1R (END) 1350230402, 0.484375 1350230403, 0.515625
我想用awk做的是:
想知道你们是否有任何关于bash / awk的小脚本的建议来完成工作。
答案 0 :(得分:1)
在bash中实现这一目标的天真方式。根本不寻求效率。没有错误检查(嗯,只有强制性的最低限度)。
将此脚本命名为 myscript 。它将需要两个参数(文件 fxL 和 fxR )。
#!/bin/bash
tmp=''
die() {
echo >&2 "$@"
exit 1
}
on_exit() {
[[ -f $tmpL ]] && rm -- "$tmpL"
[[ -f $tmpR ]] && rm -- "$tmpR"
}
last_non_blank_line() {
sed -n -e $'/^$/ !h\n$ {x;p;}' "$1"
}
(($#==2)) || die "script takes two arguments"
fL=$1
fR=$2
[[ -r "$fL" && -w "$fL" ]] || die "problem with file \`$fL'"
[[ -r "$fR" && -w "$fR" ]] || die "problem with file \`$fR'"
# read record1, line1 of fL and fR
IFS=, read min _ < "$fL"
[[ $min =~ ^[[:digit:]]+$ ]] || die "first line of \`$fL' has a bad record"
IFS=, read t _ < "$fR"
[[ $t =~ ^[[:digit:]]+$ ]] || die "first line of \`$fR' has a bad record"
((t>min)) && ((min=t))
# read record1, last line of fL and fR
IFS=, read max _ < <( last_non_blank_line "$fL")
[[ $max =~ ^[[:digit:]]+$ ]] || die "last line of \`$fL' has a bad record"
IFS=, read t _ < <(last_non_blank_line "$fR")
[[ $t =~ ^[[:digit:]]+$ ]] || die "last line of \`$fR' has a bad record"
((t<max)) && ((max=t))
# create tmp files
tmpL=$(mktemp --tmpdir) || die "can't create tmp file"
tmpR=$(mktemp --tmpdir) || die "can't create tmp file"
trap 'on_exit' EXIT
# Read fL line by line, and only keep those
# the first record of which is between min and max
while IFS=, read a b; do
[[ $a =~ ^[[:digit:]]+$ ]] && ((a<=max)) && ((a>=min)) && echo "$a,$b"
done < "$fL" > "$tmpL"
mv -- "$tmpL" "$fL"
# Same with fR:
while IFS=, read a b; do
[[ $a =~ ^[[:digit:]]+$ ]] && ((a<=max)) && ((a>=min)) && echo "$a,$b"
done < "$fR" > "$tmpR"
mv -- "$tmpR" "$fR"
并将其命名为:
$ myscript f1L f1R
首先在临时文件中使用它!没有保修!使用风险自负!
警告。由于脚本使用bash算术进行比较,因此假设每个文件中每行的第一条记录是整数 bash处理的范围。
编辑。由于您的第一个记录是浮点数,因此您无法使用上述使用bash算术的方法。一个非常有趣的方法是让bash完成所有必要的操作(获取第一行,最后一行,打开文件......)并使用bc作为算术部分。有了这个,你就不会受到数字大小的限制(bc使用任意精度),欢迎花车!例如:
#!/bin/bash
tmp=''
die() {
echo >&2 "$@"
exit 1
}
on_exit() {
[[ -f $tmpL ]] && rm -- "$tmpL"
[[ -f $tmpR ]] && rm -- "$tmpR"
}
last_non_blank_line() {
sed -n -e $'/^$/ !h\n$ {x;p;}' "$1"
}
(($#==2)) || die "script takes two arguments"
fL=$1
fR=$2
[[ -r "$fL" && -w "$fL" ]] || die "problem with file \`$fL'"
[[ -r "$fR" && -w "$fR" ]] || die "problem with file \`$fR'"
# read record1, line1 of fL and fR
IFS=, read a _ < "$fL"
IFS=, read b _ < "$fR"
min=$(bc <<< "if($b>$a) { print \"$b\" } else { print \"$a\" }" 2> /dev/null)
[[ -z $min ]] && die "problem in first line of files \`$fL' or \`$fR'"
# read record1, last line of fL and fR
IFS=, read a _ < <( last_non_blank_line "$fL")
IFS=, read b _ < <(last_non_blank_line "$fR")
max=$(bc <<< "if($b<$a) { print \"$b\" } else { print \"$a\" }" 2> /dev/null)
[[ -z $max ]] && die "problem in last line of files \`$fL' or \`$fR'"
# create tmp files
tmpL=$(mktemp --tmpdir) || die "can't create tmp file"
tmpR=$(mktemp --tmpdir) || die "can't create tmp file"
trap 'on_exit' EXIT
# Read fL line by line, and only keep those
# the first record of which is between min and max
while read l; do
[[ $l =~ ^[[:space:]]*$ ]] && continue
r=${l%%,*}
printf "if(%s>=$min && %s<=$max) { print \"%s\n\" }\n" "$r" "$r" "$l"
done < "$fL" | bc > "$tmpL" || die "Error in bc while doing file \`$fL'"
# Same with fR:
while read l; do
[[ $l =~ ^[[:space:]]*$ ]] && continue
r=${l%%,*}
printf "if(%s>=$min && %s<=$max) { print \"%s\n\" }\n" "$r" "$r" "$l"
done < "$fR" | bc > "$tmpR" || die "Error in bc while doing file \`$fR'"
mv -- "$tmpL" "$fL"
mv -- "$tmpR" "$fR"
答案 1 :(得分:1)
我尝试包含所有必要的健全性检查并最小化光盘I / O(假设您的文件足够大以至于读取它们是时间限制因素)。此外,文件永远不必在内存中读取(假设您的文件可能比可用的RAM更大)。
然而,这只是尝试使用非常基本的虚拟输入 - 所以请测试它并报告任何问题。
首先我写了一个修剪一对的脚本(由f ... L文件名标识):
#!/bin/sh
#############
# trim_pair #
#-----------#############################
# given fXL file path, trim fXL and fXR #
#########################################
#---------------#
# sanity checks #
#---------------#
# error function
error(){
echo >&2 "$@"
exit 1
}
# argument given?
[[ $# -eq 1 ]] || \
error "usage: $0 <file>"
LFILE="$1"
# argument format valid?
[[ `basename "$LFILE" | egrep '^f[[:digit:]]+L$'` ]] || \
error "invalid file name: $LFILE (has to match /^f[[:digit:]]+L$/)"
RFILE="`echo $LFILE | sed s/L$/R/`" # is there a better POSIX compliant way?
# files exists?
[[ -e "$LFILE" ]] || \
error "file does not exist: $LFILE"
[[ -e "$RFILE" ]] || \
error "file does not exist: $RFILE"
# files readable?
[[ -r "$LFILE" ]] || \
error "file not readable: $LFILE"
[[ -r "$RFILE" ]] || \
error "file not readable: $RFILE"
# files writable?
[[ -w "$LFILE" ]] || \
error "file not writable: $LFILE"
[[ -w "$RFILE" ]] || \
error "file not writable: $RFILE"
#------------------#
# create tmp files #
# & ensure removal #
#------------------#
# cleanup function
cleanup(){
[[ -e "$LTMP" ]] && rm -- "$LTMP"
[[ -e "$RTMP" ]] && rm -- "$RTMP"
}
# cleanup on exit
trap 'cleanup' EXIT
#create tmp files
LTMP=`mktemp --tmpdir` || \
error "tmp file creation failed"
RTMP=`mktemp --tmpdir` || \
error "tmp file creation failed"
#----------------------#
# process both files #
# prepended by their #
# first and last lines #
#----------------------#
# extract first and last lines without reading the whole files twice
{
head -q -n1 "$LFILE" "$RFILE" # no need to read the whole files
tail -q -n1 "$LFILE" "$RFILE" # no need to read the whole files
} | awk -F, '
NF!=2{
print "incorrect file format: record "FNR" in file "FILENAME > "/dev/stderr"
exit 1
}
NR==1{ # read record 1,
x1=$1 # field 1 of L,
next # then read
}
NR==2{ # record 1 of R,
x1=$1>x1?$1:x1 # field 1 & take the max,
next # then
}
NR==3{ # read last record,
x2=$1 # field 1 of L,
next # then
}
NR==4{ # last record of R
x2=$1>x2?$1:x2 # field 1 & take the max
next
}
FILENAME!="-"&&NR<5{
print "too few lines in input" > "/dev/stderr"
}
FNR==1{
outfile=FILENAME~/L$/?"'"$LTMP"'":"'"$RTMP"'"
}
$1>=x1&&$1<=x2{
print > outfile
}
' - "$LFILE" "$RFILE" || \
error "error while trimming"
#-----------------------#
# re-save trimmed files #
# under the same names #
#-----------------------#
mv -- "$LTMP" "$LFILE" || \
error "cannot re-save $LFILE"
mv -- "$RTMP" "$RFILE" || \
error "cannot re-save $RFILE"
正如您所看到的,主要想法是使用head
和tail
按重要行添加输入,然后根据您的请求使用awk
处理它们。
要为某个目录中的所有文件调用该脚本,您可以使用以下脚本(不像上面那样详细说明,但我想您可以自己想出类似的东西):
#!/bin/sh
############
# trim all #
#----------###################################
# find L files in current or given directory #
# and trim the corresponding file pairs #
##############################################
TRIM_PAIR="trim_pair" # path to the trim script for one pair
if [[ $# -eq 1 ]]
then
WD="$1"
else
WD="`pwd`"
fi
find "$WD" \
-type f \
-readable \
-writable \
-regextype posix-egrep \
-regex "^$WD/"'f[[:digit:]]+L' \
-exec "$TRIM_PAIR" "{}" \;
请注意,您必须拥有PATH
上的trim_pair脚本,或者调整TRIM_PAIR
脚本中的trim_all
变量。
答案 2 :(得分:1)
使用perl:
use warnings;
use strict;
my $dir = $ARGV[0]; # directory is argument
my @pairs;
for my $file (glob "$dir/f[0-9]*L") {
my $n = ($file =~ /(\d+)/)[0];
my ($fn1, $fn2) = ($file, "f${n}R");
my ($dL, $dR) = (loadfile($fn1), loadfile($fn2));
my ($min, $max) = (min($dL->[0][1], $dR->[0][1]),
max($dL->[-1][1], $dR->[-1][1]));
trimfile($fn1, $dL, $min, $max);
trimfile($fn2, $dL, $min, $max);
}
sub loadfile {
my ($fname, @d) = (shift);
open(my $fh, "<", $fname) or die ("$!");
chomp, push(@d, [ split(/[, ]+/, $_) ]) while <$fh>;
close $fh;
return \@d;
}
sub trimfile {
my ($fname, $data, $min, $max) = @_;
open(my $fh, ">", $fname) or die ("$!");
print($fh $_->[0], " ", $_->[1], "\n") for @$data;
close $fh;
}
sub min { my ($a,$b) = @_; return $a < $b ? $a : $b; }
sub max { my ($a,$b) = @_; return $a > $b ? $a : $b; }