Question

我有一个这种格式的文本文件：

abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375 Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375 aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375 abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625

我在第一个空格之前将第一个字符串称为word（例如abacısı）

以第一个空格开头并以整数结尾的字符串为definition（例如Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875）

我想这样做：如果一行包含多个定义（第一行有一行，第二行有两行，第三行有三行），请应用换行并将第一个字符串（word）放入新线的开头。预期产出：

abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
abacı Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
abacılarla aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacılarla abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625

我的文本文件中有大约1.500.000行，并且每行的定义数量不确定。它可以是1到5

Answer 1

小python脚本完成这项工作。输入在input.txt中，输出到output.txt。

import re

rf = re.compile('([^\s]+\s).+')
r = re.compile('([^\s]+\s\:\s\d+\.\d+)')

with open("input.txt", "r") as f:
    text = f.read()

with open("output.txt", "w") as f:
    for l in text.split('\n'):
        offset = 0
        first = ""
        match = re.search(rf, l[offset:])
        if match:
            first = match.group(1)
            offset = len(first)
        while True:
            match =  re.search(r, l[offset:])
            if not match:
                break
            s = match.group(1)
            offset += len(s)
            f.write(first + " " + s + "\n")

Answer 2

我假设以下格式：

word definitionkey : definitionvalue [definitionkey : definitionvalue …]

这些元素都不包含空格，并且它们始终由单个空格分隔。

以下代码应该有效：

awk '{ for (i=2; i<=NF; i+=3) print $1, $i, $(i+1), $(i+2) }' file

说明（这是相同的代码，但带有注释和更多空格）：

awk '
  # match any line
  {
    # iterate over each "key : value"
    for (i=2; i<=NF; i+=3)
      print $1, $i, $(i+1), $(i+2)  # prints each "word key : value"
  }
' file

awk有一些你可能不熟悉的技巧。它在逐行的基础上工作。每个节都有一个可选的条件（awk 'NF >=4 {…}'在这里有意义，因为我们将给出少于四个字段的错误）。 NF是字段数，而美元符号（$）表示我们需要给定字段的值，因此$1是第一个字段$NF的值是最后一个字段的值，$(i+1)是第三个字段的值（假设i=2）。 print将默认在其参数之间使用空格并在末尾添加换行符（否则，我们需要printf "%s %s %s %s\n", $1, $i, $(i+1), $(i+2)，这有点难以阅读。）

Answer 3

使用perl：

perl -a -F'[^]:]\K\h' -ne 'chomp(@F);$p=shift(@F);print "$p ",shift(@F),"\n" while(@F);' yourfile.txt

使用bash：

while read -r line
do
    pre=${line%% *}
    echo "$line" | sed 's/\([0-9]\) /\1\n'$pre' /g'
done < "yourfile.txt"

此脚本逐行读取文件。对于每一行，前缀用参数扩展（直到第一个空格）提取，前面有数字的空格用换行符替换，前缀用sed替换。

编辑：正如tripleee建议的那样，使用sed完成所有操作要快得多：

sed -i.bak ':a;s/^\(\([^ ]*\).*[0-9]\) /\1\n\2 /;ta' yourfile.txt

Answer 4

假设每个定义总有4个以空格分隔的单词：

awk '{for (i=1; i<NF; i+=4) print $i, $(i+1), $(i+2), $(i+3)}' file

或者，如果在该浮点数之后进行拆分

perl -pe 's/\b\d+\.\d+\K\s+(?=\S)/\n/g' file

（这是相当于Avinash的回答）

Answer 5

Bash和grep：

size_t

#!/bin/bash while IFS=' ' read -r in1 in2 in3 in4; do if [[ -n $in4 ]]; then prepend="$in1" echo "$in1 $in2 $in3 $in4" else echo "$prepend $in1 $in2 $in3" fi done < <(grep -o '[[:alnum:]][^:]\+ : [[:digit:].]\+' "$1")的输出将所有定义放在一个单独的行上，但源自同一行的定义在开头缺少“单词”：

grep -o

abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875 abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375 Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875 abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375 aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375 abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625循环现在遍历此循环，使用空格作为输入文件分隔符。如果for是一个零长度的字符串，那么我们就在缺少“word”的行上，所以我们将它添加到前面。

该脚本将输入文件名作为其参数，并且可以通过简单的重定向将输出保存到输出文件：

in4

Answer 6

使用perl：

$ perl -nE 'm/([^ ]*) (.*)/; my $word=$1; $_=$2; say $word . " " . $_ for / *(.*?[0-9]+\.[0-9]+)/g;' < input.log

Output:
abacası Abaca[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 20.1748046875
abacı Abaç[Noun]+[Prop]+[A3sg]+SH[P3sg]+[Nom] : 16.3037109375
abacı Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+[A3sg]+[Pnon]+[Nom] : 23.0185546875
abacılarla Aba[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 27.8974609375
abacılarla aba[Noun]+[A3sg]+[Pnon]+[Nom]-CH[Noun+Agt]+lAr[A3pl]+[Pnon]+YlA[Ins] : 23.3427734375
abacılarla abacı[Noun]+lAr[A3pl]+[Pnon]+YlA[Ins] : 19.556640625

说明：

拆分该行，将第一个字段分隔为word。
然后使用正则表达式.*?[0-9]+\.[0-9]+分割剩余的行。
打印word与上述正则表达式的每个匹配项连接。

Answer 7

我会在这里找到一个优秀的Awk答案;但是我发布了一个Python解决方案来指出当前接受的答案中的一些奇怪和问题：

在处理之前将整个输入文件读入内存。这对于小投入来说是无害的，但OP提到现实世界的投入很大。
当简单的空白标记化似乎已足够时，它会不必要地使用re。

我也更喜欢打印到标准输出的工具，这样我就可以将它从shell中重定向到我想要的位置;但要保持与早期解决方案的兼容性，请将output.txt硬编码为目标文件。

with open('input.txt', 'r') as input:
  with open('output.txt', 'w') as output:
    for line in input:
      tokens = line.rstrip().split()
      word = tokens[0]
      for idx in xrange(1, len(tokens), 3):
          print(word, ' ', ' '.join(tokens[idx:idx+3]), file=output)

如果你真的，真的想在纯Bash中这样做，我想你可以：

while read -r word analyses; do
    set -- $analyses
    while [ $# -gt 0 ]; do
        printf "%s %s %s %s\n" "$word" "$1" "$2" "$3"
        shift; shift; shift
    done
done <input.txt >output.txt

Answer 8

请找到以下bash代码

    #!/bin/bash
    # read.sh
    while read variable
    do
            for i in "$variable"
            do
                    var=`echo "$i" |wc -w`
                    array_1=( $i )
                    counter=0
                    for((j=1 ; j < $var ; j++))
                    do
                            if [ $counter = 0 ]  #1
                            then
                                    echo -ne ${array_1[0]}' '
                            fi #1
                            echo -ne ${array_1[$j]}' '
                            counter=$(expr $counter + 1)
                            if [ $counter = 3 ] #2
                            then
                                    counter=0
                                    echo
                            fi #2
                    done
            done
    done

我已经过测试，它正在运行。去测试在bash shell提示符下输入以下命令

     $ ./read.sh < input.txt > output.txt

其中read.sh是脚本，input.txt是输入文件，output.txt是生成输出的地方

Answer 9

这里有一个sed in action

sed -r '/^indirger(ken|di)/{s/([0-9]+[.][0-9]+ )(indirge)/\1\n\2/g}' my_file

输出

indirgerdi indirge[Verb]+[Pos]+Hr[Aor]+[A3sg]+YDH[Past] : 22.2626953125 
indirge[Verb]+[Pos]+Hr[Aor]+YDH[Past]+[A3sg] : 18.720703125
indirgerken indirge[Verb]+[Pos]+Hr[Aor]+[A3sg]-Yken[Adv+While] : 19.6201171875

将第一个字符串复制到第二行

9 个答案: