在制表符分隔文件上读取bash而不会删空字段

时间:2011-01-07 03:49:55

标签: bash

我正在尝试在bash中读取多行制表符分隔文件。格式是预期的空字段。不幸的是,shell正在将彼此相邻的字段分隔符折叠在一起,如下所示:

# IFS=$'\t'
# read one two three <<<$'one\t\tthree'
# printf '<%s> ' "$one" "$two" "$three"; printf '\n'
<one> <three> <>

...而不是<one> <> <three>的期望输出。

这可以在不诉诸单独的语言(例如awk)的情况下解决吗?

6 个答案:

答案 0 :(得分:11)

不确定


IFS=,
echo $'one\t\tthree' | tr \\11 , | (
  read one two three
  printf '<%s> ' "$one" "$two" "$three"; printf '\n'
)

我稍微重新安排了一些示例,但只是为了让它在任何Posix shell中都能正常工作。

更新:是的,似乎空白是特殊的,至少在IFS中是这样。从bash(1)看本段的后半部分:

   The shell treats each character of IFS as a delimiter, and  splits  the
   results of the other expansions into words on these characters.  If IFS
   is unset, or its value is exactly <space><tab><newline>,  the  default,
   then  any  sequence  of IFS characters serves to delimit words.  If IFS
   has a value other than the default, then sequences  of  the  whitespace
   characters  space  and  tab are ignored at the beginning and end of the
   word, as long as the whitespace character is in the value  of  IFS  (an
   IFS whitespace character).  Any character in IFS that is not IFS white-
   space, along with any adjacent IFS whitespace  characters,  delimits  a
   field.   A  sequence  of IFS whitespace characters is also treated as a
   delimiter.  If the value of IFS is null, no word splitting occurs.

答案 1 :(得分:4)

没有必要使用tr,但IFS必须是非空格字符(否则倍数会折叠成单身,如您所见)。

$ IFS=, read -r one two three <<<'one,,three'
$ printf '<%s> ' "$one" "$two" "$three"; printf '\n'
<one> <> <three>

$ var=$'one\t\tthree'
$ var=${var//$'\t'/,}
$ IFS=, read -r one two three <<< "$var"
$ printf '<%s> ' "$one" "$two" "$three"; printf '\n'
<one> <> <three>

$ idel=$'\t' odel=','
$ var=$'one\t\tthree'
$ var=${var//$idel/$odel}
$ IFS=$odel read -r one two three <<< "$var"
$ printf '<%s> ' "$one" "$two" "$three"; printf '\n'
<one> <> <three>

答案 2 :(得分:3)

我写了一个解决这个问题的函数。这个特定的实现特别是关于制表符分隔的列和换行符分隔的行,但这个限制可以作为一个简单的练习删除:

read_tdf_line() {
    local default_ifs=$' \t\n'
    local n line element at_end old_ifs
    old_ifs="${IFS:-${default_ifs}}"
    IFS=$'\n'

    if ! read -r line ; then
        return 1
    fi
    at_end=0
    while read -r element; do
        if (( $# > 1 )); then
            printf -v "$1" '%s' "$element"
            shift
        else
            if (( at_end )) ; then
                # replicate read behavior of assigning all excess content
                # to the last variable given on the command line
                printf -v "$1" '%s\t%s' "${!1}" "$element"
            else
                printf -v "$1" '%s' "$element"
                at_end=1
            fi
        fi
    done < <(tr '\t' '\n' <<<"$line")

    # if other arguments exist on the end of the line after all
    # input has been eaten, they need to be blanked
    if ! (( at_end )) ; then
        while (( $# )) ; do
            printf -v "$1" '%s' ''
            shift
        done
    fi

    # reset IFS to its original value (or the default, if it was
    # formerly unset)
    IFS="$old_ifs"
}

用法如下:

# read_tdf_line one two three rest <<<$'one\t\tthree\tfour\tfive'
# printf '<%s> ' "$one" "$two" "$three" "$rest"; printf '\n'
<one> <> <three> <four       five>

答案 3 :(得分:3)

这是一种带有一些细节的方法:

  • 从主代码中的任何地方输入数据(避免了数据仅在管道的一个阶段内可用的常见问题)。
  • 不使用awk,tr或其他外部编程
  • 一个get / put访问器对,用于隐藏更发型的语法
  • 使用参数匹配而不是IFS =
  • 在制表符分隔的行上工作

代码。 file_datafile_input仅用于生成输入,就像从脚本调用的外部命令一样。可以为datacols调用等参数化getput,但此脚本不会那么远。

#!/bin/bash

file_data=( $'\t\t'       $'\t\tbC'     $'\tcB\t'     $'\tdB\tdC'   \
            $'eA\t\t'     $'fA\t\tfC'   $'gA\tgB\t'   $'hA\thB\thC' )
file_input () { printf '%s\n' "${file_data[@]}" ; }  # simulated input file
delim=$'\t'

# the IFS=$'\n' has a side-effect of skipping blank lines; acceptable:
OIFS="$IFS" ; IFS=$'\n' ; oset="$-" ; set -f
lines=($(file_input))                    # read the "file"
set -"$oset" ; IFS="$OIFS" ; unset oset  # cleanup the environment mods.

# the read-in data has (rows * cols) fields, with cols as the stride:
data=()
cols=0
get () { local r=$1 c=$2 i ; (( i = cols * r + c )) ; echo "${data[$i]}" ; }
put () { local r=$1 c=$2 i ; (( i = cols * r + c )) ; data[$i]="$3" ; }

# convert the lines from input into the pseudo-2D data array:
i=0 ; row=0 ; col=0
for line in "${lines[@]}" ; do
    line="$line$delim"
    while [ -n "$line" ] ; do
        case "$line" in
            *${delim}*) data[$i]="${line%%${delim}*}" ; line="${line#*${delim}}" ;;
            *)          data[$i]="${line}"            ; line=                     ;;
        esac
        (( ++i ))
    done
    [ 0 = "$cols" ] && (( cols = i )) 
done
rows=${#lines[@]}

# output the data array as a matrix, using the get accessor
for    (( row=0 ; row < rows ; ++row )) ; do
   printf 'row %2d: ' $row
   for (( col=0 ; col < cols ; ++col )) ; do
       printf '%5s ' "$(get $row $col)"
   done
   printf '\n'
done

输出:

$ ./tabtest 
row  0:                   
row  1:                bC 
row  2:          cB       
row  3:          dB    dC 
row  4:    eA             
row  5:    fA          fC 
row  6:    gA    gB       
row  7:    hA    hB    hC 

答案 4 :(得分:3)

这是我使用的快速而简单的功能,可以避免调用外部程序或限制输入字符的范围。它仅适用于bash(我猜)。

如果要允许比字段更多的变量,则需要根据Charles Duffy的答案进行修改。

# Substitute for `read -r' that doesn't merge adjacent delimiters.
myread() {
        local input
        IFS= read -r input || return $?
        while [[ "$#" -gt 1 ]]; do
                IFS= read -r "$1" <<< "${input%%[$IFS]*}"
                input="${input#*[$IFS]}"
                shift
        done
        IFS= read -r "$1" <<< "$input"
}

答案 5 :(得分:0)

为了防止空字段的崩溃,您可以使用 除 IFS“空白”字符之外的任何分隔符。

不同分隔符的行为示例:

#!/bin/bash

for delimiter in  $'\t'  ','  '|'  $'\377'  $'\x1f'  ;do
  line="one${delimiter}${delimiter}three"
  IFS=$delimiter read one two three <<<"$line"
  printf '<%s> ' "$one" "$two" "$three"; printf '\n'
done

<one> <three> <>
<one> <> <three>
<one> <> <three>
<one> <> <three>
<one> <> <three>

或者使用 OP 的原始示例:

IFS='|' read one two three <<<$(tr '\t' '|' <<<$'one\t\tthree')
printf '<%s> ' "$one" "$two" "$three"; printf '\n'

<one> <> <three>