Question

我正在为短语生成模型中的文本输入格式化语言语料库。现在，语料库本质上是一个长文本文件，其相关行如下所示：

*EXP:   I didn't understand what you said .
*CHI:   I know [!] &=laugh (.) .

我已经可以使用grep来获取所有以'*'开头的行。我想要做的是删除所有那些删除了5个字符+标签标题的行（删除* EXP：或* CHI：或其他）并删除所有非字母字符，如括号，parens和句点。唯一的例外是撇号 - 我需要将撇号转换为仅用于此模型的'@'符号。此外，我想摆脱以'＆amp;'开头的令牌符号，因为它们是非单词的话语。所以我的目标输出将是这样的：

I didn@t understand what you said

I know

我是Unix文本操作的新手，所以我很感激任何帮助！

Answer 1

您可以使用cut删除前缀，例如：

$ cat corpus.txt | cut -c 9-
I didn't understand what you said .
I know [!] &=laugh (.) .

然后要删除非单词代币，您可以像这样使用sed：

$ cat corpus.txt | cut -c 9- | sed 's/\&[^ ]*//g'
I didn't understand what you said .
I know [!]  (.) .

最后，要删除非字母符号并将撇号转换为@，您可以通过以下两个步骤将其输入sed：

$ cat corpus.txt | cut -c 9- | sed 's/\&[^ ]*//g' | sed "s/[^a-zA-Z ']//g" | sed "s/'/@/g"
I didn@t understand what you said
I know

Answer 2

使用perl：

perl -lne '
    /^\*\w{3}:\s+(.*)/ and do {
        $_ = $1;
        s/[^\w\s\047]//g;
        s/\047/@/g;
        print
    }
' file

解释：

perl -lne ' # using -n is like while (<>) {}
    # regex to match criterias & using capturing group for
    # the interesting ending part :
    /^\*\w{3}:\s+(.*)/ and do {
        $_ = $1; # assigning the captured group on the default variable $_
        s/[^\w\s\047]//g; # replace ponctuation chars by nothing
        s/\047/@/g; # replace single quote with @
        print # print the modified line
    }
' file

输出：

I didn@t understand what you said 
I know  laugh

Answer 3

这可能适合你（GNU sed）：

sed 's/^.....\t//;s/&\S\+//g;y/'\''/\n/;s/[[:punct:]]//g;y/\n/@/' file

删除行的前面，删除话语，用换行符替换单引号，删除标点符号并用@替换换行符。

Answer 4

GNU awk 4.1

#!/usr/bin/awk -f
@include "join"
/^*/ {
  gsub(/'/, "@")
  gsub(/&=\S+/, "")
  gsub(/[^[:alnum:][:blank:]@]/, "")
  split($0, foo)
  print join(foo, 2, NF)
}

Answer 5

sed -n "
# filter line with special starting pattern *AAA:Tab
/^\*[A-Z]\{3}:\t/ {
# remove head using last search pattern by default
   s///
# change quote by @
   y/'/@/
# remove token
   s/\&=[^ ]*//g
# remove non alphabetic (maybe number have to be keep also ?) but @
   s/[^a-zA-Z@]//g
# print only those line
   p
   }" YourFile

Posix版本（因此--posix在gnu sed上）。可以通过删除评论并在需要时将;替换换行来成为OneLine

使用grep / awk / sed删除正则表达式和非字母字符

5 个答案: