我有两个文件,一个有17k行,另一个有4k行。我想将位置115与位置125与第二个文件中的每一行进行比较,如果匹配,则将第一个文件中的整行写入新文件。我想出了一个解决方案,我用'cat $ filename |来读取文件同时阅读LINE'。但它需要大约8分钟才能完成。还有其他方法,比如使用'awk'来减少这个处理时间。
我的代码
cat $filename | while read LINE
do
#read 115 to 125 and then remove trailing spaces and leading zeroes
vid=`echo "$LINE" | cut -c 115-125 | sed 's,^ *,,; s, *$,,' | sed 's/^[0]*//'`
exist=0
#match vid with entire line in id.txt
exist=`grep -x "$vid" $file_dir/id.txt | wc -l`
if [[ $exist -gt 0 ]]; then
echo "$LINE" >> $dest_dir/id.txt
fi
done
答案 0 :(得分:2)
这是怎么回事:
FNR==NR { # FNR == NR is only true in the first file
s = substr($0,115,10) # Store the section of the line interested in
sub(/^\s*/,"",s) # Remove any leading whitespace
sub(/\s*$/,"",s) # Remove any trailing whitespace
lines[s]=$0 # Create array of lines
next # Get next line in first file
}
{ # Now in second file
for(i in lines) # For each line in the array
if (i~$0) { # If matches the current line in second file
print lines[i] # Print the matching line from file1
next # Get next line in second file
}
}
将其保存到脚本script.awk
并运行如下:
$ awk -f script.awk "$filename" "${file_dir}/id.txt" > "${dest_dir}/id.txt"
这仍然会很慢,因为对于第二个文件中的每一行,您需要查看第一个中约50%的唯一行(假设大多数行确实匹配)。如果您可以确认第二个文件中的行是与子字符串的完整行匹配,则可以显着改善这一点。
对于全线匹配,这应该更快:
FNR==NR { # FNR == NR is only true in the first file
s = substr($0,115,10) # Store the section of the line interested in
sub(/^\s*/,"",s) # Remove any leading whitespace
sub(/\s*$/,"",s) # Remove any trailing whitespace
lines[s]=$0 # Create array of lines
next # Get next line in first file
}
($0 in lines) { # Now in second file
print lines[$0] # Print the matching line from file1
}