这与previous question非常相关,但我意识到我的目标要复杂得多:
我有一句话:"Forbes Asia 200 Best Under 500 Billion 2011"
我有像这样的令牌:
oldTokens = [u'Forbes', u'Asia', u'200', u'Best', u'Under', u'500', u'Billion', u'2011']
前一个解析器找出应该有位置或数字槽的位置的索引:
numberTokenIDs = {(7,): 2011.0, (2,): 200.0, (5,6): 500000000000.00}
locationTokenIDs = {(0, 1): u'Forbes Asia'}
令牌ID对应于有位置或数字的令牌的索引,目标是获取一组新的令牌,如:
newTokens = [u'Asia', u'200', u'Best', u'Under', u'500', u'2011']
使用新的数字和位置标记ID也许(为了避免索引越界异常):
numberTokenIDs = {(5,): 2011.0, (1,): 200.0, (4,): 500000000000.00}
locationTokenIDs = {(0,): u'Forbes Asia'}
基本上我想通过新的简化令牌集,并能够最终创建一个名为的新句子:
"LOCATION_SLOT NUMBER_SLOT Best Under NUMBER_SLOT NUMBER_SLOT"
通过新的令牌集并用“LOCATION_SLOT”或“NUMBER_SLOT”替换正确的tokenID。如果我使用当前的数字和位置令牌ID进行此操作,我会得到:
"LOCATION_SLOT LOCATION_SLOT NUMBER_SLOT Best Under NUMBER_SLOT NUMBER_SLOT NUMBER_SLOT".
我该怎么做?
另一个例子是:
Location token IDs are: (0, 1)
Number token IDs are: (3, 4)
Old sampleTokens [u'United', u'Kingdom', u'USD', u'1.240', u'billion']
我想要同时删除令牌,还要更改位置和号码令牌ID,以便能够替换句子,如:
sampleTokens[numberTokenID] = "NUMBER_SLOT"
sampleTokens[locationTokenID] = "LOCATION_SLOT"
这样替换的令牌是[u'LOCATION_SLOT', u'USD', u'NUMBER_SLOT']
答案 0 :(得分:1)
不是一个非常优雅但有效的解决方案:
oldTokens = [u'Forbes', u'Asia', u'200', u'Best', u'Under', u'500', u'Billion', u'2011']
numberTokenIDs = {(7,): 2011.0, (2,): 200.0, (5,6): 500000000000.00}
locationTokenIDs = {(0, 1): u'Forbes Asia'}
newTokens = []
newnumberTokenIDs = {}
newlocationTokenIDs = {}
new_ind = 0
skip = False
for ind in range(len(oldTokens)):
if skip:
skip=False
continue
for loc_ind in locationTokenIDs.keys():
if ind in loc_ind:
newTokens.append(oldTokens[ind+1])
newlocationTokenIDs[(new_ind,)] = locationTokenIDs[loc_ind]
new_ind += 1
if len(loc_ind) > 1: # Skip next position if there are 2 elements in a tuple
skip = True
break
else:
for num_ind in numberTokenIDs.keys():
if ind in num_ind:
newTokens.append(oldTokens[ind])
newnumberTokenIDs[(new_ind,)] = numberTokenIDs[num_ind]
new_ind += 1
if len(num_ind) > 1:
skip = True
break
else:
newTokens.append(oldTokens[ind])
new_ind += 1
newTokens
Out[37]: [u'Asia', u'200', u'Best', u'Under', u'500', u'2011']
newnumberTokenIDs
Out[38]: {(1,): 200.0, (4,): 500000000000.0, (5,): 2011.0}
newlocationTokenIDs
Out[39]: {(0,): u'Forbes Asia'}