我正在为短语生成模型中的文本输入格式化语言语料库。现在,语料库本质上是一个长文本文件,其相关行如下所示:
*EXP: I didn't understand what you said .
*CHI: I know [!] &=laugh (.) .
我已经可以使用grep来获取所有以'*'开头的行。我想要做的是删除所有那些删除了5个字符+标签标题的行(删除* EXP:或* CHI:或其他)并删除所有非字母字符,如括号,parens和句点。唯一的例外是撇号 - 我需要将撇号转换为仅用于此模型的'@'符号。此外,我想摆脱以'&'开头的令牌符号,因为它们是非单词的话语。所以我的目标输出将是这样的:
I didn@t understand what you said
I know
我是Unix文本操作的新手,所以我很感激任何帮助!
答案 0 :(得分:1)
您可以使用cut
删除前缀,例如:
$ cat corpus.txt | cut -c 9-
I didn't understand what you said .
I know [!] &=laugh (.) .
然后要删除非单词代币,您可以像这样使用sed
:
$ cat corpus.txt | cut -c 9- | sed 's/\&[^ ]*//g'
I didn't understand what you said .
I know [!] (.) .
最后,要删除非字母符号并将撇号转换为@
,您可以通过以下两个步骤将其输入sed
:
$ cat corpus.txt | cut -c 9- | sed 's/\&[^ ]*//g' | sed "s/[^a-zA-Z ']//g" | sed "s/'/@/g"
I didn@t understand what you said
I know
答案 1 :(得分:1)
使用perl:
perl -lne '
/^\*\w{3}:\s+(.*)/ and do {
$_ = $1;
s/[^\w\s\047]//g;
s/\047/@/g;
print
}
' file
解释:
perl -lne ' # using -n is like while (<>) {}
# regex to match criterias & using capturing group for
# the interesting ending part :
/^\*\w{3}:\s+(.*)/ and do {
$_ = $1; # assigning the captured group on the default variable $_
s/[^\w\s\047]//g; # replace ponctuation chars by nothing
s/\047/@/g; # replace single quote with @
print # print the modified line
}
' file
输出:
I didn@t understand what you said
I know laugh
答案 2 :(得分:1)
这可能适合你(GNU sed):
sed 's/^.....\t//;s/&\S\+//g;y/'\''/\n/;s/[[:punct:]]//g;y/\n/@/' file
删除行的前面,删除话语,用换行符替换单引号,删除标点符号并用@
替换换行符。
答案 3 :(得分:0)
GNU awk 4.1
#!/usr/bin/awk -f
@include "join"
/^*/ {
gsub(/'/, "@")
gsub(/&=\S+/, "")
gsub(/[^[:alnum:][:blank:]@]/, "")
split($0, foo)
print join(foo, 2, NF)
}
答案 4 :(得分:0)
sed -n "
# filter line with special starting pattern *AAA:Tab
/^\*[A-Z]\{3}:\t/ {
# remove head using last search pattern by default
s///
# change quote by @
y/'/@/
# remove token
s/\&=[^ ]*//g
# remove non alphabetic (maybe number have to be keep also ?) but @
s/[^a-zA-Z@]//g
# print only those line
p
}" YourFile
Posix版本(因此--posix
在gnu sed上)。可以通过删除评论并在需要时将;
替换换行来成为OneLine