我正在寻找从值中提取字符串并将其作为日期填充在另一列中的可能性。另外,我在这里有几种不同的情况。
场景1: 下面是用csv逗号分隔的内容。在这里,列文件名中的日期是字符串格式。因此,我需要grep该特定字符串,然后将其转换并填充为具有确切日期格式的新列。
filename filesize data_received_dt tname createdt
ccaa/01APR2018-revised/ 0 2019-01-17T06:16:59.000Z sample 2018-03-15T09:51:36.000Z
ccaa/01APR2018/content_01APR2018-00000.csv 115814528 2018-12-05T23:38:10.000Z live 2018-03-15T09:51:36.000Z
ccaa/01APR2018-revised/content_01APR2018-00001.csv 116584541 2018-12-05T23:38:09.000Z test 2018-03-15T09:51:36.000Z
ccaa/01JUN2018-revised/content_01JUN2018-00002.csv 117363985 2018-12-05T23:38:09.000Z sample 2018-03-15T09:51:36.000Z
ccaa/10JUL2018/content_10JUL2018-00002.csv 117363985 2018-12-05T23:38:09.000Z sample 2018-03-15T09:51:36.000Z
ccaa/21AUG2018-revised/content_21AUG2018-00002.csv 117363985 2018-12-05T23:38:09.000Z sample 2018-03-15T09:51:36.000Z
输出应如下所示。另外,根据要求,end_dt列的值与start_dt的值相同。
filename start_dt end_dt filesize data_received_dt name createdt
ccaa/01APR2018-revised/ 1-Apr-18 1-Apr-18 0 2019-01-17T06:16:59.000Z sample 2018-03-15T09:51:36.000Z
ccaa/01APR2018-revised/content_01APR2018-00000.csv 1-Apr-18 1-Apr-18 115814528 2018-12-05T23:38:10.000Z live 2018-03-15T09:51:36.000Z
ccaa/01APR2018-revised/content_01APR2018-00001.csv 1-Apr-18 1-Apr-18 116584541 2018-12-05T23:38:09.000Z test 2018-03-15T09:51:36.000Z
ccaa/01JUN2018-revised/content_01JUN2018-00002.csv 1-Jun-18 1-Jun-18 117363985 2018-12-05T23:38:09.000Z sample 2018-03-15T09:51:36.000Z
ccaa/01JUL2018-revised/content_10JUL2018-00002.csv 10-Jul-18 10-Jul-18 117363985 2018-12-05T23:38:09.000Z sample 2018-03-15T09:51:36.000Z
ccaa/01AUG2018-revised/content_21AUG2018-00002.csv 21-Aug-18 21-Aug-18 117363985 2018-12-05T23:38:09.000Z sample 2018-03-15T09:51:36.000Z
方案2:
在这种情况下,文件名中的字符串格式已完全更改,格式为YYYYMM。
filename size date tname
ccaa/201802/ 0 2019-01-17T06:16:34.000Z sample
ccaa/201802/Feb2018000000_0.csv 32602738 2018-09-11T04:05:38.000Z live
ccaa/201802/Feb2018000001_0.csv 32602738 2018-09-11T04:05:38.000Z test
ccaa/201802/Feb2018000002_0.csv 32602738 2018-09-11T04:05:38.000Z sample
ccaa/201802/Feb2018000003_0.csv 32602187 2018-09-11T04:05:38.000Z sample
这里,棘手的事情之一是基于YYYYMM格式,需要用30天的日期范围填充start_dt和end_dt列。请参阅下面的内容
filename start_dt end_dt size date tname
ccaa/201802/ 0 2019-01-17T06:16:34.000Z sample
ccaa/201803/March2018000000_0.csv 1-Mar-18 31-Mar-18 32602738 2018-09-11T04:05:38.000Z live
ccaa/201804/Apr2018000001_0.csv 1-Apr-18 30-Apr-18 32602738 2018-09-11T04:05:38.000Z test
ccaa/201805/May2018000002_0.csv 1-May-18 31-May-18 32602738 2018-09-11T04:05:38.000Z sample
ccaa/201808/Aug2018000003_0.csv 1-Aug-18 30-Aug-18 32602187 2018-09-11T04:05:38.000Z sample
方案3:
另一种情况是获取字符串(2018_Q1)。并且需要根据关键字Q1,Q2,Q3和Q4每季度填充一次。
输出如下所示
filename start_dt end_dt size date tname
ccll/2018_Q1/ 0 2019-01-17T06:16:34.000Z sample
ccll/2018_Q1/March2018000000_0.csv 1-Jan-18 31-Mar-18 32602738 2018-09-11T04:05:38.000Z live
ccll/2018_Q2/Apr2018000001_0.csv 1-Apr-18 30-Jun-18 32602738 2018-09-11T04:05:38.000Z test
ccll/2018_Q3/May2018000002_0.csv 1-Jul-18 30-Sep-18 32602738 2018-09-11T04:05:38.000Z sample
答案 0 :(得分:0)
代码注释:
#!/bin/bash
# Scenario 1
echo 'filename filesize data_received_dt tname createdt
ccaa/01APR2018-revised/ 0 2019-01-17T06:16:59.000Z sample 2018-03-15T09:51:36.000Z
ccaa/01APR2018/content_01APR2018-00000.csv 115814528 2018-12-05T23:38:10.000Z live 2018-03-15T09:51:36.000Z
ccaa/01APR2018-revised/content_01APR2018-00001.csv 116584541 2018-12-05T23:38:09.000Z test 2018-03-15T09:51:36.000Z
ccaa/01JUN2018-revised/content_01JUN2018-00002.csv 117363985 2018-12-05T23:38:09.000Z sample 2018-03-15T09:51:36.000Z
ccaa/10JUL2018/content_10JUL2018-00002.csv 117363985 2018-12-05T23:38:09.000Z sample 2018-03-15T09:51:36.000Z
ccaa/21AUG2018-revised/content_21AUG2018-00002.csv 117363985 2018-12-05T23:38:09.000Z sample 2018-03-15T09:51:36.000Z
' |
# remove first line with headers
tail -n +2 |
# for each line
while
IFS=' ' read -r name size received_dt tname create_dt &&
# stop on empty lines
[ -n "$name" ]
do
# get the second directory name from the name
# this is a smarty way of getting the last second field from the right
dir=$(<<<"$name" rev | cut -d'/' -f2 | rev)
# if the 2nd dir doesn't end with -revised, add -revised
# (I think this could be just one sed command)
if ! <<<"$dir" grep -q -- "-revised$"; then
dir2=$(dirname "$(dirname "$name")")
dir="${dir}-revised"
name=$dir2/$dir/$(basename "$name")
fi
# extract date data from the dir
day_from_dir=${dir:0:2}
month_from_dir="${dir:2:1}$(<<<"${dir:3:2}" tr [:upper:] [:lower:])"
year_from_dir="20${dir:7:2}"
# get start and end dates
start_dt=$(
LC_ALL=C date \
--date="${day_from_dir} ${month_from_dir} ${year_from_dir} 00:00:00" \
+%-d-%b-%g
)
end_dt=$start_dt
# printf ouput
printf "%s %s %s %s %s %s %s\n" \
"$name" \
"$start_dt" \
"$end_dt" \
"$size" \
"$received_dt" \
"$tname" \
"$create_dt"
done |
# format the output - left justify and set column names
column -t -s ' ' -o ' ' -N \
"filename,start_dt,end_dt,filesize,data_received_dt,name,createdt"
# Scenario 2
echo 'filename size date tname
ccaa/201802/ 0 2019-01-17T06:16:34.000Z sample
ccaa/201802/Feb2018000000_0.csv 32602738 2018-09-11T04:05:38.000Z live
ccaa/201804/Feb2018000001_0.csv 32602738 2018-09-11T04:05:38.000Z test
ccaa/201805/Feb2018000002_0.csv 32602738 2018-09-11T04:05:38.000Z sample
ccaa/201806/Feb2018000003_0.csv 32602187 2018-09-11T04:05:38.000Z sample
' |
tail -n +2 |
while
IFS=' ' read -r name size date tname && [ -n "$name" ]
do
# get the second directory name from the name
# this is a smarty way of getting the last second field from the right
dir=$(<<<"$name" rev | cut -d'/' -f2 | rev)
# extract date from dir
year=${dir:0:4}
month=${dir:4:2}
ts="${year}-${month}-01T00:00:00-00:00"
# set date format
start_dt=$(
LC_ALL=C date \
--date="$ts" \
+%-d-%b-%g
)
# we want last month day - add 1 month and subtract 1 day
end_dt=$(
LC_ALL=C date \
--date="$ts +1 month -1 day" \
+%-d-%b-%g
)
# and output
printf "%s %s %s %s %s %s\n" \
"$name" \
"$start_dt" \
"$end_dt" \
"$size" \
"$date" \
"$tname"
done |
column -t -s ' ' -o ' ' -N \
"filename,start_dt,end_dt,size,date,tname"
# Scenario 3 - same as 2 but different naming scheme or smth
echo 'filename size date tname
ccaa/201802/ 0 2019-01-17T06:16:34.000Z sample
ccaa/201802/Feb2018000000_0.csv 32602738 2018-09-11T04:05:38.000Z live
ccaa/201804/Feb2018000001_0.csv 32602738 2018-09-11T04:05:38.000Z test
ccaa/201805/Feb2018000002_0.csv 32602738 2018-09-11T04:05:38.000Z sample
ccaa/201808/Feb2018000003_0.csv 32602187 2018-09-11T04:05:38.000Z sample
' |
tail -n +2 |
while
IFS=' ' read -r name size date tname &&
[ -n "$name" ]
do
# get the second directory name from the name
# this is a smarty way of getting the last second field from the right
dir=$(<<<"$name" rev | cut -d'/' -f2 | rev)
#extract date from dir
year=${dir:0:4}
month=${dir:4:2}
ts="${year}-${month}-01T00:00:00-00:00"
quarter=$(
LC_ALL=C date \
--date="$ts" \
+%q
)
#rename the file with the year-Qq
tmp="$(dirname "$(dirname "$name")")/${year}-Q${quarter}/"
if ! <<<"$name" grep -q "/$"; then
name="${tmp}$(basename "$name")"
fi
# set date format
start_dt=$(
LC_ALL=C date \
--date="$ts" \
+%-d-%b-%g
)
# we want last month day - add 1 month and subtract 1 day
end_dt=$(
LC_ALL=C date \
--date="$ts +1 month -1 day" \
+%-d-%b-%g
)
# and output
printf "%s %s %s %s %s %s\n" \
"$name" \
"$start_dt" \
"$end_dt" \
"$size" \
"$date" \
"$tname"
done |
# add create nice looking table with header names
column -t -s ' ' -o ' ' -N \
"filename,start_dt,end_dt,size,date,tname"
jdoodle的输出:
filename start_dt end_dt filesize data_received_dt name createdt
ccaa/01APR2018-revised/ 1-Apr-18 1-Apr-18 0 2019-01-17T06:16:59.000Z sample 2018-03-15T09:51:36.000Z
ccaa/01APR2018-revised/content_01APR2018-00000.csv 1-Apr-18 1-Apr-18 115814528 2018-12-05T23:38:10.000Z live 2018-03-15T09:51:36.000Z
ccaa/01APR2018-revised/content_01APR2018-00001.csv 1-Apr-18 1-Apr-18 116584541 2018-12-05T23:38:09.000Z test 2018-03-15T09:51:36.000Z
ccaa/01JUN2018-revised/content_01JUN2018-00002.csv 1-Jun-18 1-Jun-18 117363985 2018-12-05T23:38:09.000Z sample 2018-03-15T09:51:36.000Z
ccaa/10JUL2018-revised/content_10JUL2018-00002.csv 10-Jul-18 10-Jul-18 117363985 2018-12-05T23:38:09.000Z sample 2018-03-15T09:51:36.000Z
ccaa/21AUG2018-revised/content_21AUG2018-00002.csv 21-Aug-18 21-Aug-18 117363985 2018-12-05T23:38:09.000Z sample 2018-03-15T09:51:36.000Z
filename start_dt end_dt size date tname
ccaa/201802/ 1-Feb-18 28-Feb-18 0 2019-01-17T06:16:34.000Z sample
ccaa/201802/Feb2018000000_0.csv 1-Feb-18 28-Feb-18 32602738 2018-09-11T04:05:38.000Z live
ccaa/201804/Feb2018000001_0.csv 1-Apr-18 30-Apr-18 32602738 2018-09-11T04:05:38.000Z test
ccaa/201805/Feb2018000002_0.csv 1-May-18 31-May-18 32602738 2018-09-11T04:05:38.000Z sample
ccaa/201806/Feb2018000003_0.csv 1-Jun-18 30-Jun-18 32602187 2018-09-11T04:05:38.000Z sample
filename start_dt end_dt size date tname
ccaa/201802/ 1-Feb-18 28-Feb-18 0 2019-01-17T06:16:34.000Z sample
ccaa/2018-Q1/Feb2018000000_0.csv 1-Feb-18 28-Feb-18 32602738 2018-09-11T04:05:38.000Z live
ccaa/2018-Q2/Feb2018000001_0.csv 1-Apr-18 30-Apr-18 32602738 2018-09-11T04:05:38.000Z test
ccaa/2018-Q2/Feb2018000002_0.csv 1-May-18 31-May-18 32602738 2018-09-11T04:05:38.000Z sample
ccaa/2018-Q3/Feb2018000003_0.csv 1-Aug-18 31-Aug-18 32602187 2018-09-11T04:05:38.000Z sample
$( ... )
。