Question

我一直在与Python的NTLK一起进行通用语言解析，最近我想创建一个专门用于翻译的语料库。我一直无法理解NTLK用于翻译的语料库选项和结构。

material on how to read or use corpus resources有很多，但是我无法找到有关创建翻译样式语料库的任何详细信息。通过浏览语料库参考文献可以了解到，样式和类型多种多样，但是我似乎找不到任何翻译专用的语料库示例或文档。

Answer 1

对于像数据集这样的翻译，NLTK可以使用AlignedCorpusReader来读取单词对齐的句子的语料库。文件必须具有以下格式：

first source sentence
first target sentence 
first alignment
second source sentence
second target sentence
second alignment

这意味着令牌被假定为由空格分隔，并且句子以单独的行开头。例如，假设您具有如下目录结构：

reader.py
data/en-es.txt
data/en-pt.txt

文件内容在哪里：

# en-es.txt
This is an example
Esto es un ejemplo
0-0 1-1 2-2 3-3

和

# en-pt.txt
This is an example
Esto é um exemplo
0-0 1-1 2-2 3-3

您可以使用以下脚本加载此玩具示例：

# reader.py    
from nltk.corpus.reader.aligned import AlignedCorpusReader

reader = AlignedCorpusReader('./data', '.*', '.txt', encoding='utf-8')

for sentence in reader.aligned_sents():
    print(sentence.words)
    print(sentence.mots)
    print(sentence.alignment)

输出

['This', 'is', 'an', 'example']
['Esto', 'es', 'un', 'ejemplo']
0-0 1-1 2-2 3-3
['This', 'is', 'an', 'example']
['Esto', 'é', 'um', 'exemplo']
0-0 1-1 2-2 3-3

行reader = AlignedCorpusReader('./data', '.*', '.txt', encoding='utf-8')创建AlignedCorpusReader的实例，该实例读取'./data'目录中所有以'.txt'结尾的文件。它还指定文件的编码为'utf-8'。 AlignedCorpusReader的其他参数是word_tokenizer和sent_tokenizer，word_tokenizer设置为WhitespaceTokenizer()，而sent_tokenizer设置为RegexpTokenizer('\n', gaps=True)。

可以在文档中找到更多信息（1和2）。

如何为python NLTK建立翻译语料库？

1 个答案: