是否有可能从值中提取字符串并转换为日期值

时间:2019-01-18 20:08:30

标签: shell

我正在寻找从值中提取字符串并将其作为日期填充在另一列中的可能性。另外,我在这里有几种不同的情况。

场景1: 下面是用csv逗号分隔的内容。在这里,列文件名中的日期是字符串格式。因此,我需要grep该特定字符串,然后将其转换并填充为具有确切日期格式的新列。

filename    filesize    data_received_dt    tname   createdt
ccaa/01APR2018-revised/ 0   2019-01-17T06:16:59.000Z    sample  2018-03-15T09:51:36.000Z
ccaa/01APR2018/content_01APR2018-00000.csv  115814528   2018-12-05T23:38:10.000Z    live    2018-03-15T09:51:36.000Z
ccaa/01APR2018-revised/content_01APR2018-00001.csv  116584541   2018-12-05T23:38:09.000Z    test    2018-03-15T09:51:36.000Z
ccaa/01JUN2018-revised/content_01JUN2018-00002.csv  117363985   2018-12-05T23:38:09.000Z    sample  2018-03-15T09:51:36.000Z
ccaa/10JUL2018/content_10JUL2018-00002.csv  117363985   2018-12-05T23:38:09.000Z    sample  2018-03-15T09:51:36.000Z
ccaa/21AUG2018-revised/content_21AUG2018-00002.csv  117363985   2018-12-05T23:38:09.000Z    sample  2018-03-15T09:51:36.000Z

输出应如下所示。另外,根据要求,end_dt列的值与start_dt的值相同。

filename                                            start_dt    end_dt      filesize    data_received_dt            name    createdt
ccaa/01APR2018-revised/                             1-Apr-18    1-Apr-18    0           2019-01-17T06:16:59.000Z    sample  2018-03-15T09:51:36.000Z
ccaa/01APR2018-revised/content_01APR2018-00000.csv  1-Apr-18    1-Apr-18    115814528   2018-12-05T23:38:10.000Z    live    2018-03-15T09:51:36.000Z
ccaa/01APR2018-revised/content_01APR2018-00001.csv  1-Apr-18    1-Apr-18    116584541   2018-12-05T23:38:09.000Z    test    2018-03-15T09:51:36.000Z
ccaa/01JUN2018-revised/content_01JUN2018-00002.csv  1-Jun-18    1-Jun-18    117363985   2018-12-05T23:38:09.000Z    sample  2018-03-15T09:51:36.000Z
ccaa/01JUL2018-revised/content_10JUL2018-00002.csv  10-Jul-18   10-Jul-18   117363985   2018-12-05T23:38:09.000Z    sample  2018-03-15T09:51:36.000Z
ccaa/01AUG2018-revised/content_21AUG2018-00002.csv  21-Aug-18   21-Aug-18   117363985   2018-12-05T23:38:09.000Z    sample  2018-03-15T09:51:36.000Z

方案2:

在这种情况下,文件名中的字符串格式已完全更改,格式为YYYYMM。

filename                        size        date                        tname
ccaa/201802/                    0           2019-01-17T06:16:34.000Z    sample
ccaa/201802/Feb2018000000_0.csv 32602738    2018-09-11T04:05:38.000Z    live
ccaa/201802/Feb2018000001_0.csv 32602738    2018-09-11T04:05:38.000Z    test
ccaa/201802/Feb2018000002_0.csv 32602738    2018-09-11T04:05:38.000Z    sample
ccaa/201802/Feb2018000003_0.csv 32602187    2018-09-11T04:05:38.000Z    sample

这里,棘手的事情之一是基于YYYYMM格式,需要用30天的日期范围填充start_dt和end_dt列。请参阅下面的内容

filename                            start_dt    end_dt      size        date                        tname
ccaa/201802/                                                0           2019-01-17T06:16:34.000Z    sample
ccaa/201803/March2018000000_0.csv   1-Mar-18    31-Mar-18   32602738    2018-09-11T04:05:38.000Z    live
ccaa/201804/Apr2018000001_0.csv     1-Apr-18    30-Apr-18   32602738    2018-09-11T04:05:38.000Z    test
ccaa/201805/May2018000002_0.csv     1-May-18    31-May-18   32602738    2018-09-11T04:05:38.000Z    sample
ccaa/201808/Aug2018000003_0.csv     1-Aug-18    30-Aug-18   32602187    2018-09-11T04:05:38.000Z    sample

方案3:

另一种情况是获取字符串(2018_Q1)。并且需要根据关键字Q1,Q2,Q3和Q4每季度填充一次。

输出如下所示

filename                            start_dt    end_dt          size    date                        tname
ccll/2018_Q1/                                                   0       2019-01-17T06:16:34.000Z    sample
ccll/2018_Q1/March2018000000_0.csv  1-Jan-18    31-Mar-18   32602738    2018-09-11T04:05:38.000Z    live
ccll/2018_Q2/Apr2018000001_0.csv    1-Apr-18    30-Jun-18   32602738    2018-09-11T04:05:38.000Z    test
ccll/2018_Q3/May2018000002_0.csv    1-Jul-18    30-Sep-18   32602738    2018-09-11T04:05:38.000Z    sample

1 个答案:

答案 0 :(得分:0)

代码注释:

#!/bin/bash

# Scenario 1

echo 'filename    filesize    data_received_dt    tname   createdt
ccaa/01APR2018-revised/ 0   2019-01-17T06:16:59.000Z    sample  2018-03-15T09:51:36.000Z
ccaa/01APR2018/content_01APR2018-00000.csv  115814528   2018-12-05T23:38:10.000Z    live    2018-03-15T09:51:36.000Z
ccaa/01APR2018-revised/content_01APR2018-00001.csv  116584541   2018-12-05T23:38:09.000Z    test    2018-03-15T09:51:36.000Z
ccaa/01JUN2018-revised/content_01JUN2018-00002.csv  117363985   2018-12-05T23:38:09.000Z    sample  2018-03-15T09:51:36.000Z
ccaa/10JUL2018/content_10JUL2018-00002.csv  117363985   2018-12-05T23:38:09.000Z    sample  2018-03-15T09:51:36.000Z
ccaa/21AUG2018-revised/content_21AUG2018-00002.csv  117363985   2018-12-05T23:38:09.000Z    sample  2018-03-15T09:51:36.000Z
' |
# remove first line with headers
tail -n +2 |
# for each line
while 
    IFS=' ' read -r name size received_dt tname create_dt &&
    # stop on empty lines
    [ -n "$name" ]
do

    # get the second directory name from the name
    # this is a smarty way of getting the last second field from the right
    dir=$(<<<"$name" rev | cut -d'/' -f2 | rev)

    # if the 2nd dir doesn't end with -revised, add -revised 
    # (I think this could be just one sed command)
    if ! <<<"$dir" grep -q -- "-revised$"; then
        dir2=$(dirname "$(dirname "$name")")
        dir="${dir}-revised"
        name=$dir2/$dir/$(basename "$name")
    fi

    # extract date data from the dir
    day_from_dir=${dir:0:2}
    month_from_dir="${dir:2:1}$(<<<"${dir:3:2}" tr [:upper:] [:lower:])"
    year_from_dir="20${dir:7:2}"

    # get start and end dates
    start_dt=$(
        LC_ALL=C date \
        --date="${day_from_dir} ${month_from_dir} ${year_from_dir} 00:00:00" \
        +%-d-%b-%g
    )
    end_dt=$start_dt

    # printf ouput
    printf "%s %s %s %s %s %s %s\n" \
        "$name" \
        "$start_dt" \
        "$end_dt" \
        "$size" \
        "$received_dt" \
        "$tname" \
        "$create_dt"

done |
# format the output - left justify and set column names
column -t -s ' ' -o '  ' -N \
    "filename,start_dt,end_dt,filesize,data_received_dt,name,createdt"


# Scenario 2

echo 'filename                        size        date                        tname
ccaa/201802/                    0           2019-01-17T06:16:34.000Z    sample
ccaa/201802/Feb2018000000_0.csv 32602738    2018-09-11T04:05:38.000Z    live
ccaa/201804/Feb2018000001_0.csv 32602738    2018-09-11T04:05:38.000Z    test
ccaa/201805/Feb2018000002_0.csv 32602738    2018-09-11T04:05:38.000Z    sample
ccaa/201806/Feb2018000003_0.csv 32602187    2018-09-11T04:05:38.000Z    sample
' |
tail -n +2 |
while 
    IFS=' ' read -r name size date tname && [ -n "$name" ]
do
    # get the second directory name from the name
    # this is a smarty way of getting the last second field from the right
    dir=$(<<<"$name" rev | cut -d'/' -f2 | rev)

    # extract date from dir
    year=${dir:0:4}
    month=${dir:4:2}
    ts="${year}-${month}-01T00:00:00-00:00"

    # set date format
    start_dt=$(
        LC_ALL=C date \
        --date="$ts" \
        +%-d-%b-%g
    )
    # we want last month day - add 1 month and subtract 1 day
    end_dt=$(
        LC_ALL=C date \
        --date="$ts +1 month -1 day" \
        +%-d-%b-%g
    )

    # and output
    printf "%s %s %s %s %s %s\n" \
        "$name" \
        "$start_dt" \
        "$end_dt" \
        "$size" \
        "$date" \
        "$tname"

done |
column -t -s ' ' -o '  ' -N \
"filename,start_dt,end_dt,size,date,tname"

# Scenario 3 - same as 2 but different naming scheme or smth

echo 'filename                        size        date                        tname
ccaa/201802/                    0           2019-01-17T06:16:34.000Z    sample
ccaa/201802/Feb2018000000_0.csv 32602738    2018-09-11T04:05:38.000Z    live
ccaa/201804/Feb2018000001_0.csv 32602738    2018-09-11T04:05:38.000Z    test
ccaa/201805/Feb2018000002_0.csv 32602738    2018-09-11T04:05:38.000Z    sample
ccaa/201808/Feb2018000003_0.csv 32602187    2018-09-11T04:05:38.000Z    sample
' |
tail -n +2 |
while 
    IFS=' ' read -r name size date tname &&
    [ -n "$name" ]
do
    # get the second directory name from the name
    # this is a smarty way of getting the last second field from the right
    dir=$(<<<"$name" rev | cut -d'/' -f2 | rev)


    #extract date from dir
    year=${dir:0:4}
    month=${dir:4:2}
    ts="${year}-${month}-01T00:00:00-00:00"
    quarter=$(
        LC_ALL=C date \
        --date="$ts" \
        +%q
    )

    #rename the file with the year-Qq
    tmp="$(dirname "$(dirname "$name")")/${year}-Q${quarter}/"
    if ! <<<"$name" grep -q "/$"; then
        name="${tmp}$(basename "$name")"
    fi

    # set date format
    start_dt=$(
        LC_ALL=C date \
        --date="$ts" \
        +%-d-%b-%g
    )
    # we want last month day - add 1 month and subtract 1 day
    end_dt=$(
        LC_ALL=C date \
        --date="$ts  +1 month -1 day" \
        +%-d-%b-%g
    )

    # and output
    printf "%s %s %s %s %s %s\n" \
        "$name" \
        "$start_dt" \
        "$end_dt" \
        "$size" \
        "$date" \
        "$tname"

done |
# add create nice looking table with header names
column -t -s ' ' -o '  ' -N \
"filename,start_dt,end_dt,size,date,tname"

jdoodle的输出:

filename                                            start_dt   end_dt     filesize   data_received_dt          name    createdt
ccaa/01APR2018-revised/                             1-Apr-18   1-Apr-18   0          2019-01-17T06:16:59.000Z  sample  2018-03-15T09:51:36.000Z
ccaa/01APR2018-revised/content_01APR2018-00000.csv  1-Apr-18   1-Apr-18   115814528  2018-12-05T23:38:10.000Z  live    2018-03-15T09:51:36.000Z
ccaa/01APR2018-revised/content_01APR2018-00001.csv  1-Apr-18   1-Apr-18   116584541  2018-12-05T23:38:09.000Z  test    2018-03-15T09:51:36.000Z
ccaa/01JUN2018-revised/content_01JUN2018-00002.csv  1-Jun-18   1-Jun-18   117363985  2018-12-05T23:38:09.000Z  sample  2018-03-15T09:51:36.000Z
ccaa/10JUL2018-revised/content_10JUL2018-00002.csv  10-Jul-18  10-Jul-18  117363985  2018-12-05T23:38:09.000Z  sample  2018-03-15T09:51:36.000Z
ccaa/21AUG2018-revised/content_21AUG2018-00002.csv  21-Aug-18  21-Aug-18  117363985  2018-12-05T23:38:09.000Z  sample  2018-03-15T09:51:36.000Z
filename                         start_dt  end_dt     size      date                      tname
ccaa/201802/                     1-Feb-18  28-Feb-18  0         2019-01-17T06:16:34.000Z  sample
ccaa/201802/Feb2018000000_0.csv  1-Feb-18  28-Feb-18  32602738  2018-09-11T04:05:38.000Z  live
ccaa/201804/Feb2018000001_0.csv  1-Apr-18  30-Apr-18  32602738  2018-09-11T04:05:38.000Z  test
ccaa/201805/Feb2018000002_0.csv  1-May-18  31-May-18  32602738  2018-09-11T04:05:38.000Z  sample
ccaa/201806/Feb2018000003_0.csv  1-Jun-18  30-Jun-18  32602187  2018-09-11T04:05:38.000Z  sample
filename                          start_dt  end_dt     size      date                      tname
ccaa/201802/                      1-Feb-18  28-Feb-18  0         2019-01-17T06:16:34.000Z  sample
ccaa/2018-Q1/Feb2018000000_0.csv  1-Feb-18  28-Feb-18  32602738  2018-09-11T04:05:38.000Z  live
ccaa/2018-Q2/Feb2018000001_0.csv  1-Apr-18  30-Apr-18  32602738  2018-09-11T04:05:38.000Z  test
ccaa/2018-Q2/Feb2018000002_0.csv  1-May-18  31-May-18  32602738  2018-09-11T04:05:38.000Z  sample
ccaa/2018-Q3/Feb2018000003_0.csv  1-Aug-18  31-Aug-18  32602187  2018-09-11T04:05:38.000Z  sample
  • 请勿使用反引号`进行命令替换,它们不允许嵌套,不可读且已弃用。请改用$( ... )