Question

我有一些第三方Windows软件提供的Unicode / UTF-8文本文件，其中包含大约十列数据。

标题行用制表符分隔。但是，其余各行以空格分隔（不是制表符分隔！）（如在Notepad ++或TextWrangler中打开文件时所见）。

这是文件的前四行（例如）： x y z（ns）z（cm）z-abs（cm）经度-纬度-N type_of_object描述 728243.03 5993753.83 0 0 0 143.537779835969 -36.1741232463362 linestart DRIVEWAYGRAVEL 728242.07 5993756.02 0 0 0 143.537768534943 -36.1741037476109线DRIVEWAYGRAVEL 728242.26 5993756.11 0 0 0 143.537770619485 -36.1741028922293 linestart DRIVEWAYGRAVEL

x       y   z(ns)       z(cm)   z-abs(cm)   longitude-  E   latitude-   N   type_of_object  description
 728243.03     5993753.83    0             0             0             143.537779835969           -36.1741232463362           linestart     DRIVEWAYGRAVEL
 728242.07     5993756.02    0             0             0             143.537768534943           -36.1741037476109           line          DRIVEWAYGRAVEL
 728242.26     5993756.11    0             0             0             143.537770619485           -36.1741028922293           linestart     DRIVEWAYGRAVEL

（n.b。每行开头的空格，标题行除外）

我正试图编写一个Bash脚本来重新格式化数据，以导入到另一个Windows程序中。

（我知道我可以在Windows命令行上执行此操作，但是我对此没有经验，因此宁愿将文件复制到我的Debian机器上并在Bash中创建脚本。这意味着输入文件和输出文件需要与Windows兼容，但是脚本本身显然可以在Linux中运行。）

我需要执行以下操作：

使用逗号定界符提取前两列（x和y坐标），但仅提取倒数第二列中包含“矩形”的行。
在每行末尾添加1或0。第一行应为1，第2-4行应为0，第5行应为1，第6-8行应为0，依此类推。也就是说，每四行（从第一行开始）应该有一个1，其他行应该有一个0。

因此，输出文件应如下所示：

728257.89,5993759.24,1
728254.83,5993758.54,0
728251.82,5993762.4,0
728242.45,5993765.07,0

我尝试了the answer to this question。例如

awk '
NR==1{
    for(i=1;i<=NF;i++)
        if($i!="z(ns)")
            cols[i]
}
{
    for(i=1;i<=NF;i++)
        if(i in cols)
            printf "%s ",$i
    printf "\n"
}' input.file > output.file

...删除第三列（然后对此进行变体以消除其他不需要的列）。但是，我剩下的只是一个空的输出文件。

我还尝试使用grep和awk一起破解一个解决方案：

touch output.txt
count=0
IFS=$'\n'
set -f #disable globbing
for i in $( grep "rectangle" $inputFile )
do
    Xcoord=$(awk 'BEGIN { FS=" " } { print $1 }' $i )
    printf "$Xcoord" >> output.txt
    echo ","
    Ycoord=$(awk 'BEGIN { FS=" " } { print $2 }' $i )
    printf "$Ycoord" >> output.txt
    printf ","
    count=$((count+1))
    if [[ count = "1" ]]
    then
        printf "$count\n" >> output.txt
    else
        printf "0\n" >> output.txt
    fi
done
set +f #re-enable globbing for future use of the terminal.

...这背后的想法是： -对于$ inputFile中包含“矩形”的每一行

1. Append the first column (variable "Xcoord") to output.txt
2. Append a comma to output.txt
3. Append the second column (variable "Ycoord") to output.txt
4. Append another comma to output.txt
5. Append the 1 or 0 as per the if test based on the value of the variable "count", along with a new line.

这个想法失败了。它没有将数据保存到文件中，而是将文件的所有列打印到stdout，第一列替换为文本“（没有这样的文件或目录）”：

...并且output.txt几乎是零：

我该如何解决？
我需要做任何事情来使生成的output.txt文件成为Windows格式吗？

预先感谢...

Answer 1

我认为awk可以在一行中满足您的所有需求：

 awk -F '[[:space:]][[:space:]]+' 'BEGIN{OFS = ","} {if ($8 == "rectangle") print $1, $2 }' a.txt | awk 'BEGIN{OFS = ","}{if((NR+3)%4) print $0,0;else print $0,1}'

您通过

将条目之间的定界符设置为“ at least two spaces”

-F '[[:space:]][[:space:]]+

将输出分隔符设置为

'BEGIN{OFS = ","}

在最后第二列中检查矩形条件

if ($8 == "rectangle")

并打印您要作为输出的列

print $1, $2

要在第三输出列中添加0,1模式，必须重新启动awk以获取结果文件的行号，而不是原始输入行。 awk NR变量包含从1开始的行号。

(NR+3)%4

对于行号1,5,9，

（% is modulo-operation）结果为0（= false）... 因此，您只需打印完整的行（变量$ 0），然后在if情况下打印0，在else情况下打印1。

print $0,0;else print $0,1

希望这就是您想要的。

Answer 2

我想出了一个解决办法。

删除标题行。
使用grep过滤基于单词“ rectangle”的所有行。
用逗号替换空格，使其更易于处理。
遍历每行，并根据需要保存到文件。

#!/bin/bash
#Code here to retrieve the file from command arguments and set it as $inputFile (removed for brevity)
sed -i 1d $inputFile #Remove header line

sed 's/^ *//g' < $inputFile > work.txt #Remove first character in each line (a space).
tr -s ' ' <work.txt | tr ' ' ',' >work2.txt #Switch spaces for commas.
grep "rectangle" work2.txt > work3.txt #Print all lines containing "rectangle" in them to new file.
rm lineout.txt #Delete output file in case script was run previously.
touch lineout.txt
count=0
while IFS='' read -r line || [[ -n "$line" ]]; do
    printf "$line" > line.txt
    awk 'BEGIN { FS="," } { printf $1  >> "lineout.txt" }' line.txt
    printf "," >> lineout.txt
    awk 'BEGIN { FS="," } { printf $2  >> "lineout.txt" }' line.txt
    printf "," >> lineout.txt
    count=$((count + 1))
    if [[ $count = "1" ]]
    then
        printf "$count\n" >> lineout.txt
    else
        printf "0\n" >> lineout.txt
        if [[ $count = "4" ]]
        then
            count=0
        fi
    fi
done < work3.txt

Answer 3

可以使用具有以下功能的高级文本编辑器轻松设置其格式：

多项选择
垂直选择
搜索和替换类似于bash表达式的文本

我并不是想宣传崇高的想法，但是这个工具肯定可以解决我的大多数文本编辑问题。

从具有不同分隔符的文件中提取几个以空格分隔的字段，将其提取到Bash中的另一个文件

3 个答案: