Question

我需要从文件读取数据并插入多个文件（每个文件的大小不超过3mb，文件大小可能不同）。重要的是 - 代理的记录不应分成多个文件。我在UNIX bash脚本中的While循环中执行所有这些操作。

Input.csv
        Src,AgentNum,PhoneNum
        DWH,Agent_1234,phone1  
        NULL,NULL,phone2  
        NULL,NULL,phone3 
        DWH,Agent_5678,phone1 
        NULL,NULL,phone2 
        NULL,NULL,phone3
        DWH,Agent_9999,phone1 
        NULL,NULL,phone2 
        NULL,NULL,phone3

Desired Output -

Output1.csv (less than 3MB)
        Src,AgentNum,PhoneNum
        DWH,Agent_1234,phone1  
        NULL,NULL,phone2  
        NULL,NULL,phone3

Output2.csv (less than 3MB)
        Src,AgentNum,PhoneNum
        DWH,Agent_5678,phone1 
        NULL,NULL,phone2 
        NULL,NULL,phone3
        DWH,Agent_9999,phone1 
        NULL,NULL,phone2 
        NULL,NULL,phone3

Bash Shell脚本

#!/bin/bash
BaseFileName=$(basename $FileName | cut -d. -f1)
Header=`head -1 $FileName`
MaxFileSize=$(( 3 * 1024 * 1024 ))

    sed 1d $FileName | 
    while read -r line
    do
        echo $line >> ${BaseFileName}_${FileSeq}.csv

        MatchCount=`echo $line | grep -c -E '^.DWH'`

        if [[ $MatchCount -eq 1 ]]
        then
            FileSizeBytes=`du -b ${BaseFileName}_${FileSeq}.csv | cut -f1`
            if [[ $FileSizeBytes -gt $MaxFileSize ]] 
            then
                #Add a header record to each file
                sed -i "1i ${Header}" ${BaseFileName}_${FileSeq}.csv
                FileSeq=$((FileSeq + 1))
            fi
        fi
    done

除了以外几乎没用 1）它没有按预期分割记录（代理的某些记录分为多个文件） 2）仅为第一个输出文件插入标题记录。 3）太慢，10MB文件需要3分钟。实际上我有一个3GB的文件。

有人可以建议我在哪里做错了。有没有更好的方法来处理这个？

Answer 1

这是一次艰难的尝试 - 它不像纯粹的awk解决方案那么快，但它比你已经拥有的更快，很多：

#!/bin/bash

# two external parameters: input file name, and max size in bytes (default to 3MB)
InputFile=$1
MaxFileSize=${2:-$(( 3 * 1024 * 1024 ))}

BaseName=${InputFile%.*} # strip extension
Ext=${InputFile##*.}     # store extension
FileSeq=0                # start output file at sequence 0

# redirect stdin from the input file, stdout to the first output file
exec <"$InputFile" || exit
exec >"${BaseName}.${FileSeq}.${Ext}" || exit

# read the header; copy it to the first output file, and initialize CurFileSize
IFS= read -r Header || exit
printf '%s\n' "$Header" || exit
CurFileSize=$(( ${#Header} + 1 ))

# ...then loop over our inputs, and copy appropriately
while IFS= read -r line; do
  if [[ $line = DWH,* ]] && (( CurFileSize > MaxFileSize )); then
    (( FileSeq++ ))
    exec >"${BaseName}.${FileSeq}.${Ext}" || exit
    printf '%s\n' "$Header" || exit
    CurFileSize=$(( ${#Header} + 1 ))
  fi
  printf '%s\n' "$line" || exit
  (( CurFileSize += ${#line} + 1 ))
done

值得注意的变化：

根本没有调用外部工具。没有sed，没有basename，没有du，没有grep。无论何时写$()或``，都会产生非常重要的性能成本;除非无法避免，否则不应在紧密循环中使用这些构造 - 当使用POSIX sh标准的ksh或bash扩展时，实际上不可能避免这些构造很少见。
仅在需要打开新输出文件时才会调用重定向。我们每次要写一行时都不使用>>"$filename"，而是每次需要启动新的输出文件时都使用exec >"$filename"。
在参数扩展期间始终使用引号，除非在其他语法明确禁止字符串拆分或通配的上下文中。如果不这样做可能会损坏您的文件（例如，用当前目录中的文件列表替换*;用空格替换选项卡;等等）。如有疑问，请提供更多信息。
使用printf '%s\n'比POSIX标准更好地定义echo - 请参阅the standard definition for echo，尤其是应用程序使用部分。
我们正在明确地进行错误处理。也可以使用set -e，但使用substantial caveats。

测试程序和输出如下：

$ cat >input.csv <<'EOF'
Src,AgentNum,PhoneNum
DWH,Agent_1234,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_5678,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_9999,phone1
NULL,NULL,phone2
NULL,NULL,phone3
EOF

$ ./splitCSV input.csv 100  ## split at first boundary after 100 bytes

$ cat input.0.csv
Src,AgentNum,PhoneNum
DWH,Agent_1234,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_5678,phone1
NULL,NULL,phone2
NULL,NULL,phone3

$ cat input.1.csv
Src,AgentNum,PhoneNum
DWH,Agent_9999,phone1
NULL,NULL,phone2
NULL,NULL,phone3

在bash脚本中使用While循环的问题（将文件拆分为多个文件）

1 个答案: