在我的代码的一个阶段,我有一个接收两个或三个嵌套列表问题的函数在每个嵌套列表中,我有[word_form:word_tag]。两句话问题的一个例子是:
P1(输入):
['italian:JJ', ['an:DT'], ['became:VB', ['world:NN', ['the:DT'],
['s:PO', ['tenor:NN', ['greatest:JJ']]]]], ['.:.']]
H(输入):
['was:VX', ['there:EX'], ['an:DT'], ['italian:JJ'], ['became:VB',
['who:WP'], ['world:NN', ['the:DT'], ['s:PO', ['tenor:NN',
['greatest:JJ']]]]], ['.:.']]
每个带有['NN','VB','JJ']标记的嵌套列表,我想将它们的表单替换为X,Y,Z等变量等等
如果句子(H)与P1或P2有一个共同的词,如果存在,则它们采用相同的变量名。例如,如果H中的['italian':'JJ']转向['X':'JJ']那么它必须在P1或P2中取“X”(如果存在)。
我所做的只是将表单更改为变量,而我的变量不是X,Y,Z,我只是这样做了:
if tag in ['NN', 'VB', 'JJ']:
form = form.upper()+'-0'
将“意大利”形式改为“意大利语0”,但我更愿意将其设为[X,Y,Z,......等]
所以想要的输出是这样的:
P1(输出):
['X:JJ', ['an:DT'], ['Y:VB', ['Z:NN', ['the:DT'],
['s:PO', ['A:NN', ['B:JJ']]]]], ['.:.']]
H(输出):
['was:VX', ['there:EX'], ['an:DT'], ['X:JJ'], ['Y:VB',
['who:WP'], ['Z:NN', ['the:DT'], ['s:PO', ['A:NN',
['B:JJ']]]]], ['.:.']]
同样,三句话的问题,如:
P1(输入):
['want:VB', ['men:NN', ['every:DT'], ['italian:JJ']], ['be:VX',
['to:TO'], ['a:DT'], ['great:JJ', ['tenor:VB']]]]
P2(输入):
['are:VX', ['men:NN', ['some:DT'], ['italian:JJ']], ['great:JJ'],
['tenor:VB']]
H(输入):
['are:VX', ['there:EX'], ['italian:JJ'], ['Y:NN', ['want:VB',
['who:WP'], ['be:VX', ['to:TO'], ['a:DT'], ['great:JJ', ['tenor:VB']]]]]]
成为:
P1(输出):
['Z:VB', ['Y:NN', ['every:DT'], ['X:JJ']], ['be:VX',
['to:TO'], ['a:DT'], ['A:JJ', ['B:VB']]]]
P2(输出):
['are:VX', ['men:NN', ['some:DT'], ['X:JJ']], ['A:JJ'],
['B:VB']]
H(输出):
['are:VX', ['there:EX'], ['X:JJ'], ['Y:NN', ['Z:VB',
['who:WP'], ['be:VX', ['to:TO'], ['a:DT'], ['A:JJ', ['B:VB']]]]]]
答案 0 :(得分:1)
为了回答这个问题,我将用元组重写你的WORD
个实例。你的第一个例子变成了:
p1 = [('italian', 'JJ'),
[('an', 'DT')],
[('became', 'VB'),
[('world', 'NN'),
[('the', 'DT')],
[('s', 'PO'), [('tenor', 'NN'), [('greatest', 'JJ')]]]]],
[('.', '.')]]
h = [('was', 'VX'),
[('there', 'EX')],
[('an', 'DT')],
[('italian', 'JJ')],
[('became', 'VB'),
[('who', 'WP')],
[('world', 'NN'),
[('the', 'DT')],
[('s', 'PO'), [('tenor', 'NN'), [('greatest', 'JJ')]]]]],
[('.', '.')]]
让我们提取p1
和h
共有的单词表单列表。我们将定义一个非常简单的递归有序树遍历生成器:
def flatten(l):
for x in l:
if isinstance(x, tuple):
yield x
else:
for y in flatten(x):
yield y
注意:将tuple
更改为WORD
。
我们可以使用它来获取p1
和h
共有的字词:
>>> common_words = set(x[0] for x in flatten(h)) & set(x[0] for x in flatten(p1))
>>> common_words
{'.', 'an', 'became', 'greatest', 'italian', 's', 'tenor', 'the', 'world'}
注意:将此处的x[0]
更改为x.form
。这可以扩展为在例如(p1 | p2) & h
中获得单词形式。例如,在POS标签上的过滤可以在生成器表达式内完成:set(x[0] for x in flatten(h) if x[1] in ['NN', 'VB', 'JJ'])
。
使用某种唯一的字符串值标记这些单词:
>>> import itertools
>>> labels = dict((x, chr(y)) for x, y in
... itertools.izip(common_words, itertools.count(ord('A'))))
>>> labels
{'.': 'H',
'an': 'A',
'became': 'C',
'greatest': 'D',
'italian': 'I',
's': 'B',
'tenor': 'E',
'the': 'G',
'world': 'F'}
现在我们只需要在h
和p1
中替换这些字词的实例。
我们将构建另一个简单的递归函数:
def apply_labels(l, labels):
rv = []
for x in l:
if isinstance(x, tuple):
if x[0] in labels:
rv.append((labels[x[0]], x[1]))
else:
rv.append(x)
else:
rv.append(apply_labels(x, labels))
return rv
然后:
>>> apply_labels(h, labels)
[('was', 'VX'),
[('there', 'EX')],
[('A', 'DT')],
[('I', 'JJ')],
[('C', 'VB'),
[('who', 'WP')],
[('F', 'NN'), [('G', 'DT')], [('B', 'PO'), [('E', 'NN'), [('D', 'JJ')]]]]],
[('H', '.')]]
用p1
冲洗并重复:
>>> apply_labels(p1, labels)
[('I', 'JJ'),
[('A', 'DT')],
[('C', 'VB'),
[('F', 'NN'), [('G', 'DT')], [('B', 'PO'), [('E', 'NN'), [('D', 'JJ')]]]]],
[('H', '.')]]
这是你的第二个例子,再次表示为元组:
p1 = [('want', 'VB'),
[('men', 'NN'), [('every', 'DT')], [('italian', 'JJ')]],
[('be', 'VX'),
[('to', 'TO')],
[('a', 'DT')],
[('great', 'JJ'), [('tenor', 'VB')]]]]
p2 = [('are', 'VX'),
[('men', 'NN'), [('some', 'DT')], [('italian', 'JJ')]],
[('great', 'JJ')],
[('tenor', 'VB')]]
h = [('are', 'VX'),
[('there', 'EX')],
[('italian', 'JJ')],
[('men', 'NN'),
[('want', 'VB'),
[('who', 'WP')],
[('be', 'VX'),
[('to', 'TO')],
[('a', 'DT')],
[('great', 'JJ'), [('tenor', 'VB')]]]]]]
我们这样做:
>>> def wordset(l):
... return set(x[0] for x in flatten(l) if x[1] in ['NN', 'VB', 'JJ'])
>>> common_words = wordset(h) & (wordset(p1) | wordset(p2))
>>> common_words
{'great', 'italian', 'men', 'tenor', 'want'}
>>> labels = dict(zip(common_words,
... (chr(x) for x in itertools.count(ord('Z'), -1))))
>>> labels
{'great': 'Y', 'italian': 'Z', 'men': 'X', 'tenor': 'V', 'want': 'W'}
>>> apply_labels(p1, labels)
[('W', 'VB'),
[('X', 'NN'), [('every', 'DT')], [('Z', 'JJ')]],
[('be', 'VX'), [('to', 'TO')], [('a', 'DT')], [('Y', 'JJ'), [('V', 'VB')]]]]
答案 1 :(得分:0)
假设我理解你的问题,你可以编写一个函数
next_var = 'A'
var_dict = {}
def var_map(s):
global next_var
if s in var_dict:
return var_dict[s]
var_dict[s] = next_var
next_var = chr(ord('A') + 1)
return var_dict[s]
将对象唯一地映射到字符串。 next_var
每次调用都会增加var_map
。
然后,您可以在字符串的每个实例上调用它。如果您有超过26个变量,我可以更改next_var
更新的方式。