如何从管道分隔文件中删除额外的换行符,除了最后一行新行?

时间:2017-05-17 09:10:30

标签: shell unix awk sed scripting

我有一个包含以下数据的示例文件

No|Name|sal  
1|abc|4500  
2|gkdjkh|554  
3|fgh  
cvb|678  
4|tyu|789  
5|ghl  
tyu|5677  
6|yyui  
tyui  
uui|780  
7|tpo|567  

我需要输出数据,如下所示

No|Name|sal  
1|abc|4500  
2|gkdjkh|554  
3|fgh cvb|678  
4|tyu|789  
5|ghl tyu|5677  
6|yyui tyui uui|780  
7|tpo|567  

4 个答案:

答案 0 :(得分:0)

Perl而不是sed似乎在我的测试中工作得很好并且比sed更好:

$ perl -pe 's/^[0-9]+[|]/\0$&/g; s/\n/ /g; s/^\0/\n/g' file
No|Name|sal 
1|abc|4500 
2|gkdjkh|554 
3|fgh cvb|678 
4|tyu|789 
5|ghl tyu|5677 
6|yyui tyui uui|780 
7|tpo|567 

答案 1 :(得分:0)

awk 解决方案(基于处理输入文件的每个下一行):

rearrange_fields.awk 脚本:

void

<强> 用法

#!/bin/awk -f
BEGIN{ FS="|" } 
{
    if (NR == 1) {print $0}  # print the first header line as is
    else {
        if (NF == 3) { print $0 }
        else { 
            while ((getline nl) > 0) {     # processing each next line
            if (nl !~ /^[0-9]+\|/) {   # if it's not a regular line (with starting order digit i.e. `1|`)
                    if (prepend) { 
                        $0 = prepend" "$0  # prepend the last partial line if exists
                    }
                    $0 = $0" "nl;          # append to previous line 
                    gsub(/[[:space:]]+/," ",$0)  # remove redundant spaces
                } 
                else {
                    if (nl !~ /.+\|.+\|.+/) { # if a loop ends up with line which starts with order number 
                                              # but hasn't enough fields
                        prepend = nl
                        print $0
                    } 
                    else {
                        prepend = ""
                        print $0 RS nl        # next line is a regular valid line
                    } 
                    break
                }
            }
        }
    }
}

输出:

awk -f rearrange_fields.awk yourfile

答案 2 :(得分:0)

仅使用ggek解决方案,使用RT的正则表达式和内置gawk的{2}。 (对于不同数量的字段,将$ gawk -v RS="[^|]+([|][^|]+){2}\n" '{ gsub("\n", " ", RT); print RT}' f No|Name|sal 1|abc|4500 2|gkdjkh|554 3|fgh cvb|678 4|tyu|789 5|ghl tyu|5677 6|yyui tyui uui|780 7|tpo|567 更改为比字段数少一个。)

{{1}}

答案 3 :(得分:0)

awk适用于此问题,但我找到了sedgrep的解决方案。
困难的部分是如何处理没有|分隔符的行。你可以使用前一行连接这些行(\ d008和\ r是字符不在输入中)

sed 's/^[^|]*$/\d008&\d008/' inputfile | tr '\n' '\r' |
   sed -r "s/\r\d008([^\d008]*)\d008/\1/g" |
   tr '\r' '\n'

现在我们可以将所有行连接到一个行字符串(用下一个grep所需的标记替换\ n),并获得所需的子字符串。使用-P作为特殊字符\r

sed 's/^[^|]*$/\d008&\d008/' inputfile | tr '\n' '\r' |
   sed -r "s/\r\d008([^\d008]*)\d008/\1/g" |
   grep -Po "([^|]*\|){2}[^|\r]*" |
   tr -d '\r'

以上解决方案对于OP来说太慢了(也很复杂),但比使用while-loop要快得多:

while IFS= read -r line; do
   # process header, determine nr of pipes
   if [ -z "${slashes}" ]; then           
      slashes=${line//[^|]}               
      n_slashes=${#slashes}               
      printf "%s\n" "${line}"             
      lastslashes=0                       
      continue
   fi
   # You have to print previous line when you have the required fields
   # and the next line has new fields
   new_slashes=${line//[^|]}
   n_new_slashes=${#new_slashes}
   if (( ${n_new_slashes} + ${lastslashes} > ${n_slashes} )); then
      printf "%s\n" "${last}"
      last="${line}"
      lastslashes=${n_new_slashes}
   else
      # Append new line to last one
      last="${last}${line}"
      ((lastslashes+=n_new_slashes))
   fi
done < inputfile
echo "${last}"

通过上述原型,您可以获得awk解决方案的灵感。

awk -F '|' 'NR==1 {
        nfields=NF;
        lastfields=0;
        print
        next
        }
   NF+lastfields-1 > nfields { print last;last=$0; lastfields=NF; next }
   {lastfields+=NF-1} # Concat two fields, so substract 1
   {last=last $0}
   END {print last}
   ' inputfile