使用grep / awk / sed删除正则表达式和非字母字符

时间:2014-11-17 21:42:52

标签: regex unix awk sed grep

我正在为短语生成模型中的文本输入格式化语言语料库。现在,语料库本质上是一个长文本文件,其相关行如下所示:

*EXP:   I didn't understand what you said .
*CHI:   I know [!] &=laugh (.) .

我已经可以使用grep来获取所有以'*'开头的行。我想要做的是删除所有那些删除了5个字符+标签标题的行(删除* EXP:或* CHI:或其他)并删除所有非字母字符,如括号,parens和句点。唯一的例外是撇号 - 我需要将撇号转换为仅用于此模型的'@'符号。此外,我想摆脱以'&'开头的令牌符号,因为它们是非单词的话语。所以我的目标输出将是这样的:

I didn@t understand what you said

I know

我是Unix文本操作的新手,所以我很感激任何帮助!

5 个答案:

答案 0 :(得分:1)

您可以使用cut删除前缀,例如:

$ cat corpus.txt | cut -c 9-
I didn't understand what you said .
I know [!] &=laugh (.) .

然后要删除非单词代币,您可以像这样使用sed

$ cat corpus.txt | cut -c 9- | sed 's/\&[^ ]*//g'
I didn't understand what you said .
I know [!]  (.) .

最后,要删除非字母符号并将撇号转换为@,您可以通过以下两个步骤将其输入sed

$ cat corpus.txt | cut -c 9- | sed 's/\&[^ ]*//g' | sed "s/[^a-zA-Z ']//g" | sed "s/'/@/g"
I didn@t understand what you said
I know

答案 1 :(得分:1)

使用

perl -lne '
    /^\*\w{3}:\s+(.*)/ and do {
        $_ = $1;
        s/[^\w\s\047]//g;
        s/\047/@/g;
        print
    }
' file

解释:

perl -lne ' # using -n is like while (<>) {}
    # regex to match criterias & using capturing group for
    # the interesting ending part :
    /^\*\w{3}:\s+(.*)/ and do {
        $_ = $1; # assigning the captured group on the default variable $_
        s/[^\w\s\047]//g; # replace ponctuation chars by nothing
        s/\047/@/g; # replace single quote with @
        print # print the modified line
    }
' file

输出:

I didn@t understand what you said 
I know  laugh 

答案 2 :(得分:1)

这可能适合你(GNU sed):

sed 's/^.....\t//;s/&\S\+//g;y/'\''/\n/;s/[[:punct:]]//g;y/\n/@/' file

删除行的前面,删除话语,用换行符替换单引号,删除标点符号并用@替换换行符。

答案 3 :(得分:0)

GNU awk 4.1

#!/usr/bin/awk -f
@include "join"
/^*/ {
  gsub(/'/, "@")
  gsub(/&=\S+/, "")
  gsub(/[^[:alnum:][:blank:]@]/, "")
  split($0, foo)
  print join(foo, 2, NF)
}

答案 4 :(得分:0)

sed -n "
# filter line with special starting pattern *AAA:Tab
/^\*[A-Z]\{3}:\t/ {
# remove head using last search pattern by default
   s///
# change quote by @
   y/'/@/
# remove token
   s/\&=[^ ]*//g
# remove non alphabetic (maybe number have to be keep also ?) but @
   s/[^a-zA-Z@]//g
# print only those line
   p
   }" YourFile

Posix版本(因此--posix在gnu sed上)。可以通过删除评论并在需要时将;替换换行来成为OneLine