我正在用nltk和Mallet编写一个NER标记程序。我必须在两种格式的输入数据之间进行转换,我无法改变。
数据基本上包含带有相关标签的单词,用于监督学习,但是将数据细分为句子,因此列出了列表。
第一种格式是
tuple(list(list(word)),list(list(tag)))
,第二种格式是
list(list(tuple(word,tag))
目前我正在转换它(格式2 =>格式1):
([[tup[0] for tup in sent] for sent in train_set],
[[tup[1] for tup in sent] for sent in train_set])
示例数据:
[[('Steve','PERSON'),('runs','NONE'),('Apple','ORGANIZATION')],[('Today','NONE'),('is','NONE'),('June','DATETIME'),('27th','DATETIME')]]
和预期产出:
([['Steve', 'runs', 'Apple' ],['Today','is','June','27th']],
[['PERSON','NONE','ORGANIZATION'],['NONE','NONE','DATETIME','DATETIME']])
我在两个方向进行转换
编辑:我不一定希望它更短 - 请在python 2.7(使用代码示例)中建议更好(也更可读)的方法。
答案 0 :(得分:2)
将list(list(tuple(word,tag))
转换为tuple(list(list(word)),list(list(tag)))
非常简单:
def convert(data_structure):
sentences, tags = data_structure
container = []
for i in xrange(len(sentences)):
container.append(zip(sentences[i], tags[i]))
return container
如果您只是使用嵌套的for
循环,转换到另一个方向的代码会稍长但不会很复杂:
def convert(data_structure):
sentences = []
tags = []
for sentence in data_structure:
sentence_words = []
sentence_tags = []
for word, tag in sentence:
sentence_words.append(word)
sentence_tags.append(tag)
sentences.append(sentence_words)
tags.append(sentence_tags)
return (sentences, tags)
也许代码可以缩短,但一般原则应该是明确的,希望如此。
答案 1 :(得分:1)
您可以将内部元组转换为迭代器(使用iter
),然后在嵌套列表解析中调用next
:
lis = [[('Steve','PERSON'),('runs','NONE'),('Apple','ORGANIZATION')],
[('Today','NONE'),('is','NONE'),('June','DATETIME'),('27th','DATETIME')]]
it = [[iter(y) for y in x] for x in lis]
n = len(lis[0][0]) #Number of iterations required.
print [[[next(x) for x in i] for i in it] for _ in range(n)]
<强>输出:强>
[[['Steve', 'runs', 'Apple'], ['Today', 'is', 'June', '27th']],
[['PERSON', 'NONE', 'ORGANIZATION'], ['NONE', 'NONE', 'DATETIME', 'DATETIME']]]
答案 2 :(得分:0)
我认为正确的解决方案将是这一个:
>>> data = [[('Steve','PERSON'),('runs','NONE'),('Apple','ORGANIZATION')],[('Today','NONE'),('is','NONE'),('June','DATETIME'),('27th','DATETIME')]]
>>> tuple([ map(list, (zip(*x))) for x in data ])
([['Steve', 'runs', 'Apple'], ['PERSON', 'NONE', 'ORGANIZATION']], [['Today', 'is', 'June', '27th'], ['NONE', 'NONE', 'DATETIME', 'DATETIME']])