我有一个带元组的句子,表示某个国家或地区的位置:
sample = In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than 7,734 tons of cargo.
然后:
tokenIDs2number = {(22,): 592.00, (25,): 92630.00,(34,): 7734.00}
tokenIDs2location = {(8,9): Hong Kong}
我需要对这些元组的不同组合,创建各种句子组合,我称之为句子句子:
In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.
In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights , NUMBER_SLOT passengers , and more than 7,734 tons of cargo.
In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than NUMBER_SLOT tons of cargo.
但是,我当前的代码基本上采用了元组中元素的组合,所以我有两个句子,如:
In the first 11 months of 2004 LOCATION_SLOT Kong 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.
In the first 11 months of 2004 Hong LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.
作为一个例子。
如何解决这个问题,以便当我有len>1
的元组密钥时,我会根据我的愿望将该密钥中的所有插槽填入一个LOCATION或NUMBER个插槽?
当前代码:
for locationTokenIDs, location in tokenIDs2location.items():
for numberTokenIDs, number in tokenIDs2number.items():
sentenceDict = {}
sentenceDict["sentence"] = sample
sentenceDict["location-value-pair"] = {location:number}
for locationTokenID in locationTokenIDs:
for numberTokenID in numberTokenIDs:
finalTokens = cleanSample.split()
finalTokens[numberTokenID] = "NUMBER_SLOT"
finalTokens[locationTokenID] = "LOCATION_SLOT"
slotSentence = (" ").join(finalTokens)
sentenceDict["parsedSentence"] = slotSentence
注意,我必须创建一个字典,它还跟踪每个插槽句子组合的位置 - 值对和原始句子。关键部分是生成正确的slotSentence
。
请注意,这只是一个例子,数字甚至可能是24000000
,其中句子中的值为24 million
,相同的万亿,百万,十亿和千。
如果这是不可能的,另一个选择是填写组合中的所有插槽:
In the first 11 months of 2004 LOCATION_SLOT LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.
然后可能会调整句子以删除连续的插槽,但我的偏好是一次性完成所有操作。
答案 0 :(得分:0)
当locationTokenID实际上表示应被视为插槽的标记切片的端点时,代码将每个locationTokenID视为一个插槽。因此,我们需要删除for locationTokenID in locationTokenIDs:
循环(它在每个locationTokenID上循环,就像它是一个插槽一样),并将由该对locationTokenID定义的相应单词切换替换为单个插槽。
以下代码解决了OP中解决的问题,但其他问题仍然存在(例如只保留了最后生成的slotSentence
;我会让你解决这个问题,因为我不知道你想要什么数据结构存储插槽句子:)
sample = "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than 7,734 tons of cargo."
tokenIDs2number = {(21,): 592, (24,): 92630,(30,): 7734}
tokenIDs2location = {(7,8): 'Hong Kong'}
for locationTokenIDs, location in tokenIDs2location.items():
for numberTokenIDs, number in tokenIDs2number.items():
sentenceDict = {}
sentenceDict["sentence"] = sample
sentenceDict["location-value-pair"] = {location:number}
for numberTokenID in numberTokenIDs:
finalTokens = sample.split()
finalTokens[numberTokenID] = "NUMBER_SLOT"
finalTokens[locationTokenIDs[0]:(locationTokenIDs[1]+1)] = "LOCATION_SLOT"
slotSentence = (" ").join(finalTokens)
sentenceDict["parsedSentence"] = slotSentence
print(slotSentence)
<强>输出:强>
2004年前11个月 L O C A T I O N _ S L O T 赤Kok角国际机场每日平均处理一次 NUMBER_SLOT 次航班,92,630名乘客,超过7,734吨 货物。
2004年前11个月 L O C A T I O N _ S L O T 赤Kok角的国际机场平均每天处理一次 592个航班, NUMBER_SLOT 乘客,超过7,734吨 货物。
2004年前11个月 L O C A T I O N _ S L O T 赤Kok角的国际机场平均每天处理一次 592个航班,92,630名乘客,超过 NUMBER_SLOT 吨 货物。
这可以扩展到适用于包含任意数量空格的位置和数字。我们通过让numberTokenIDs和locationTokenIDs为2长度元组来实现这一点,为每个位置/数字指定一系列标记:
sample = "In the first 11 months of 2004 Hong Kong Central 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92 630 passengers , and more than 7 734 tons of cargo."
tokenIDs2number = {(22,22): '592', (25,26): '92 630',(32,33): '7 734'}
tokenIDs2location = {(7,9): 'Hong Kong Central'}
for locationTokenIDs, location in tokenIDs2location.items():
for numberTokenIDs, number in tokenIDs2number.items():
finalTokens = sample.split()
finalTokens[numberTokenIDs[0]:(numberTokenIDs[1]+1)] = "NUMBER_SLOT"
finalTokens[locationTokenIDs[0]:(locationTokenIDs[1]+1)] = "LOCATION_SLOT"
slotSentence = (" ").join(finalTokens)
print(slotSentence)
<强>输出:强>
2004年前11个月** L O C A T I O N _ S L O T ** 赤Kok角国际机场每日平均处理592个 航班,** N U M B E R _ S L O T **乘客,超过7 734吨 货物。
2004年前11个月** L O C A T I O N _ S L O T ** 赤Kok角国际机场每日平均处理592个 航班,92 630名乘客,超过** N U M B E R _ S L O T **吨 货物。
2004年前11个月** L O C A T I O N _ S L O T ** 赤Kok角国际机场每日平均处理** N U. M B E R _ S L O T **航班,92 630名乘客,超过7 734名 吨货物。
答案 1 :(得分:0)
考虑使用str.replace()
而不是分割和切片句子字符串。为此,您需要将tokenID2number
中的元素转换为千位分隔符,因为对于Python 2.7 +,可以使用format(int, ',')
处理@JonClements注释:
sample = "In the first 11 months of 2004 Hong Kong 's international airport " + \
"at Chek Lap Kok handled daily an average of 592 flights " + \
"92,630 passengers , and more than 7,734 tons of cargo."
tokenIDs2number = {(22,): 592, (25,): 92630,(34,): 7734}
tokenIDs2location = {(8,9): 'Hong Kong'}
sentenceList = []
# ITERATE ACROSS A LIST COMPREHENSION FOR ALL POSSIBLE COMBINATIONS
for item in [[s,i,j] for s in [sample] \
for i in tokenIDs2location.items() \
for j in tokenIDs2number.items()]:
sentenceDict = {}
sentenceDict["sentence"] = item[0]
sentenceDict["location-value-pair"] = {item[1][1]: item[2][1]}
sentenceDict["parsedSentence"] = sample.replace(item[1][1], 'LOCATION_SLOT').\
replace(format(item[2][1], ','), 'NUMBER_SLOT')
sentenceList.append(sentenceDict)
输出 (of sentenceList)
[{'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than NUMBER_SLOT tons of cargo.", 'location-value-pair': {'Hong Kong': 7734}}
{'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights 92,630 passengers , and more than 7,734 tons of cargo.", 'location-value-pair': {'Hong Kong': 592}}
{'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights NUMBER_SLOT passengers , and more than 7,734 tons of cargo.", 'location-value-pair': {'Hong Kong': 92630}}]
答案 2 :(得分:0)
我已经解决了我的用例,但使用了一种迂回的方式。
我首先允许包含多个LOCATION_SLOT
或NUMBER_SLOT
的插槽句子 - 如果组合中的一个元组包含2个或更多个插槽,我会填写所有:
sentences2location2values = []
for locationTokenIDs, location in tokenIDs2location.items():
for numberTokenIDs, number in tokenIDs2number.items():
sentenceDict = {}
sentenceDict["sentence"] = sample
sentenceDict["location-value-pair"] = {location:number}
for locationTokenID in locationTokenIDs:
sampleTokens[locationTokenID] = "LOCATION_SLOT"
for numberTokenID in numberTokenIDs:
sampleTokens[numberTokenID] = "NUMBER_SLOT"
slotSentence = (" ").join(sampleTokens)
sentenceDict["parsedSentence"] = slotSentence
sentences2location2values.append(sentenceDict)
然后,我更改解析的句子以删除连续的位置和数字槽:
for i,sentence in enumerate(sentences2location2values):
sampleTokens = sentence['parsedSentence'].split()
newTokens = []
for i,token in enumerate(sampleTokens):
if i>0 and ((token == "LOCATION_SLOT" and sampleTokens[i-1]=="LOCATION_SLOT") or (token == "NUMBER_SLOT" and sampleTokens[i-1]=="NUMBER_SLOT")):
continue
else:
newTokens.append(token)
sentence['parsedSentence']=(' ').join(newTokens)