I need to insert a new field containing the MD5 Hash value of the first field for each line of an 80 GB csv file.
For small projects, I have been able to do this in excel by passing the field value to
SELECT
CASE
WHEN <Parameters.Timeframe> = 'YTD'
THEN TO_CHAR(to_date('01-JAN-' || to_char(sysdate, 'YYYY'), 'dd-mon-yyyy'))
WHEN <Parameters.Timeframe> = 'MTD'
THEN TO_CHAR(TO_DATE('01-' || TO_CHAR(SYSDATE,'MON-YYYY'),'DD-MON-YYYY'))
WHEN <Parameters.Timeframe> = 'QTD' and TO_CHAR(SYSDATE,'Q') = 1
THEN TO_CHAR(TO_DATE('01-01-' || TO_CHAR(SYSDATE,'YYYY'),'MM-DD-YYYY'))
WHEN <Parameters.Timeframe> = 'QTD' and TO_CHAR(SYSDATE,'Q') = 2
THEN TO_CHAR(TO_DATE('04-01-' || TO_CHAR(SYSDATE,'YYYY'),'MM-DD-YYYY'))
WHEN <Parameters.Timeframe> = 'QTD' and TO_CHAR(SYSDATE,'Q') = 3
THEN TO_CHAR(TO_DATE('07-01-' || TO_CHAR(SYSDATE,'YYYY'),'MM-DD-YYYY'))
WHEN <Parameters.Timeframe> = 'QTD' and TO_CHAR(SYSDATE,'Q') = 4
THEN TO_CHAR(TO_DATE('10-01-' || TO_CHAR(SYSDATE,'YYYY'),'MM-DD-YYYY'))
ELSE
TO_CHAR(SYSDATE)
END as DATE_RANGE_START
FROM table a
where
a.created_date >= CASE
WHEN <Parameters.Timeframe> = 'YTD'
THEN to_date('01-JAN-' || to_char(sysdate, 'YYYY'), 'dd-mon-yyyy')
WHEN <Parameters.Timeframe> = 'MTD'
THEN TO_DATE('01-' || TO_CHAR(SYSDATE,'MON-YYYY'),'DD-MON-YYYY')
WHEN <Parameters.Timeframe> = 'QTD' and TO_CHAR(SYSDATE,'Q') = 1
THEN TO_DATE('01-01-' || TO_CHAR(SYSDATE,'YYYY'),'MM-DD-YYYY')
WHEN <Parameters.Timeframe> = 'QTD' and TO_CHAR(SYSDATE,'Q') = 2
THEN TO_DATE('04-01-' || TO_CHAR(SYSDATE,'YYYY'),'MM-DD-YYYY')
WHEN <Parameters.Timeframe> = 'QTD' and TO_CHAR(SYSDATE,'Q') = 3
THEN TO_DATE('07-01-' || TO_CHAR(SYSDATE,'YYYY'),'MM-DD-YYYY')
WHEN <Parameters.Timeframe> = 'QTD' and TO_CHAR(SYSDATE,'Q') = 4
THEN TO_DATE('10-01-' || TO_CHAR(SYSDATE,'YYYY'),'MM-DD-YYYY')
ELSE
SYSDATE
END
and a.created_date <= sysdate
However, with the 80 GB file, that is not an option.
Via AWK, is it possible to pull the first field of each row in this massive csv, calculate the md5 for the content of the first field, and insert that value back into the same line?
Example line:
Original:
=WEBSERVICE(CONCATENATE("https://helloacm.com/api/md5/?s="&ENCODEURL(A1)))
Revised Example line with md5ofvalue001 field inserted:
"value001","value002","Value003","Value004","Value005","Value006","Value007"
答案 0 :(得分:3)
<table>
to the rescue!
Here is a proof of concept for you
awk
答案 1 :(得分:0)
awk
很棒,但是对于你的问题,如果必须使用system()来计算md5,它可能会太慢。如果第一个字段有任何嵌入的逗号,awk
也可能不适合该任务。
在任何情况下,使用php
这是一个快速(或至少快得多)的解决方案,我发现它对各种条纹和色调的CSV都有很好的支持。您应该能够在Mac或类似Linux的平台上将其作为脚本运行。
#!/usr/bin/env php
<?php
# Syntax: $0 [PATHNAME]
# A filter that expects its input to have the CSV format.
# Input is taken from STDIN if PATHNAME is - or not specified.
# Output is the same CSV but with the md5 of the first field tacked on.
$file = ($argc > 1 && $argv[1] != "" ) ? $argv[1] : 'php://stdin';
if ( $file == "-" ) { $file = 'php://stdin'; }
$handle = @fopen($file, "r");
$sep = ",";
if ($handle) {
while (($data = fgetcsv($handle, 0, $sep)) !== FALSE) {
$num = count($data);
$data[] = md5($data[0]);
fputcsv(STDOUT, $data, $sep);
}
fclose($handle);
} else {
echo "{$argv[0]}: unable to fopen $argv[1]\n";
exit(1);
}
?>
如果你想保持输入行不变,那么你可以按字面意思读取行并使用str_getcsv()来解析它等等。
答案 2 :(得分:0)
既然你问过如何在awk中做到这一点,并假设echo val | md5sum
是如何计算“md5sum”的话,这就是awk脚本:
$ cat tst.awk
BEGIN { FS=OFS="," }
{
cmd = "echo " $1 " | md5sum"
if ( (cmd | getline md5) > 0 ) {
sub(/ .*/,"",md5)
}
else {
printf "Warning: Failed to calculate md5sum of %s at input line %d\n", $1, NR | "cat>&2"
md5 = "N/A"
}
close(cmd)
$1 = $1 OFS "\"" md5 "\""
print
}
$ awk -f tst.awk file
"value001","c36a5b774bfb2fd236331ac5ebef4266","value002","Value003","Value004","Value005","Value006","Value007"
正如其他地方所述,由于你在每一行跳入和跳出shell都会比在内部进行md5sum计算的工具慢。