Question

我有一个带元组的句子，表示某个国家或地区的位置：

sample = In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than 7,734 tons of cargo.

然后：

tokenIDs2number = {(22,): 592.00, (25,): 92630.00,(34,): 7734.00}
tokenIDs2location = {(8,9): Hong Kong}

我需要对这些元组的不同组合，创建各种句子组合，我称之为句子句子：

In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.

In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights , NUMBER_SLOT passengers , and more than 7,734 tons of cargo.

In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than NUMBER_SLOT tons of cargo.

但是，我当前的代码基本上采用了元组中元素的组合，所以我有两个句子，如：

In the first 11 months of 2004 LOCATION_SLOT Kong 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.

In the first 11 months of 2004 Hong LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.

作为一个例子。

如何解决这个问题，以便当我有len>1的元组密钥时，我会根据我的愿望将该密钥中的所有插槽填入一个LOCATION或NUMBER个插槽？

当前代码：

 for locationTokenIDs, location in tokenIDs2location.items():
                    for numberTokenIDs, number in tokenIDs2number.items():    
                        sentenceDict = {}    
                        sentenceDict["sentence"] = sample    
                        sentenceDict["location-value-pair"] = {location:number}  
                        for locationTokenID in locationTokenIDs:
                            for numberTokenID in numberTokenIDs:                                   
                                finalTokens = cleanSample.split()
                                finalTokens[numberTokenID] = "NUMBER_SLOT"
                                finalTokens[locationTokenID] = "LOCATION_SLOT"   
                                slotSentence = (" ").join(finalTokens)
                                sentenceDict["parsedSentence"] = slotSentence

注意，我必须创建一个字典，它还跟踪每个插槽句子组合的位置 - 值对和原始句子。关键部分是生成正确的slotSentence。

请注意，这只是一个例子，数字甚至可能是24000000，其中句子中的值为24 million，相同的万亿，百万，十亿和千。

如果这是不可能的，另一个选择是填写组合中的所有插槽：

In the first 11 months of 2004 LOCATION_SLOT LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.

然后可能会调整句子以删除连续的插槽，但我的偏好是一次性完成所有操作。

Answer 1

当locationTokenID实际上表示应被视为插槽的标记切片的端点时，代码将每个locationTokenID视为一个插槽。因此，我们需要删除for locationTokenID in locationTokenIDs:循环（它在每个locationTokenID上循环，就像它是一个插槽一样），并将由该对locationTokenID定义的相应单词切换替换为单个插槽。

以下代码解决了OP中解决的问题，但其他问题仍然存在（例如只保留了最后生成的slotSentence;我会让你解决这个问题，因为我不知道你想要什么数据结构存储插槽句子：）

sample = "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than 7,734 tons of cargo."

tokenIDs2number = {(21,): 592, (24,): 92630,(30,): 7734}
tokenIDs2location = {(7,8): 'Hong Kong'}

for locationTokenIDs, location in tokenIDs2location.items():
    for numberTokenIDs, number in tokenIDs2number.items():    
        sentenceDict = {}    
        sentenceDict["sentence"] = sample    
        sentenceDict["location-value-pair"] = {location:number}  
        for numberTokenID in numberTokenIDs:                                   
            finalTokens = sample.split()
            finalTokens[numberTokenID] = "NUMBER_SLOT"
            finalTokens[locationTokenIDs[0]:(locationTokenIDs[1]+1)] = "LOCATION_SLOT"   
            slotSentence = (" ").join(finalTokens)
            sentenceDict["parsedSentence"] = slotSentence
            print(slotSentence)

<强>输出：

2004年前11个月 L O C A T I O N _ S L O T   赤Kok角国际机场每日平均处理一次    NUMBER_SLOT 次航班，92,630名乘客，超过7,734吨   货物。

2004年前11个月 L O C A T I O N _ S L O T   赤Kok角的国际机场平均每天处理一次   592个航班， NUMBER_SLOT 乘客，超过7,734吨   货物。

2004年前11个月 L O C A T I O N _ S L O T   赤Kok角的国际机场平均每天处理一次   592个航班，92,630名乘客，超过 NUMBER_SLOT 吨   货物。

这可以扩展到适用于包含任意数量空格的位置和数字。我们通过让numberTokenIDs和locationTokenIDs为2长度元组来实现这一点，为每个位置/数字指定一系列标记：

sample = "In the first 11 months of 2004 Hong Kong Central 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92 630 passengers , and more than 7 734 tons of cargo."

tokenIDs2number = {(22,22): '592', (25,26): '92 630',(32,33): '7 734'}
tokenIDs2location = {(7,9): 'Hong Kong Central'}

for locationTokenIDs, location in tokenIDs2location.items():
    for numberTokenIDs, number in tokenIDs2number.items():    
        finalTokens = sample.split()
        finalTokens[numberTokenIDs[0]:(numberTokenIDs[1]+1)] = "NUMBER_SLOT"
        finalTokens[locationTokenIDs[0]:(locationTokenIDs[1]+1)] = "LOCATION_SLOT"   
        slotSentence = (" ").join(finalTokens)
        print(slotSentence)

<强>输出：

2004年前11个月** L O C A T I O N _ S L O T **   赤Kok角国际机场每日平均处理592个   航班，** N U M B E R _ S L O T **乘客，超过7 734吨   货物。

2004年前11个月** L O C A T I O N _ S L O T **   赤Kok角国际机场每日平均处理592个   航班，92 630名乘客，超过** N U M B E R _ S L O T **吨   货物。

2004年前11个月** L O C A T I O N _ S L O T **   赤Kok角国际机场每日平均处理** N U.   M B E R _ S L O T **航班，92 630名乘客，超过7 734名   吨货物。

Answer 2

考虑使用str.replace()而不是分割和切片句子字符串。为此，您需要将tokenID2number中的元素转换为千位分隔符，因为对于Python 2.7 +，可以使用format(int, ',')处理@JonClements注释：

sample = "In the first 11 months of 2004 Hong Kong 's international airport " + \
         "at Chek Lap Kok handled daily an average of 592 flights " + \
         "92,630 passengers , and more than 7,734 tons of cargo."    
tokenIDs2number = {(22,): 592, (25,): 92630,(34,): 7734}
tokenIDs2location = {(8,9): 'Hong Kong'}

sentenceList = []
# ITERATE ACROSS A LIST COMPREHENSION FOR ALL POSSIBLE COMBINATIONS
for item in [[s,i,j] for s in [sample] \
                     for i in tokenIDs2location.items() \
                     for j in tokenIDs2number.items()]:
    sentenceDict = {}  
    sentenceDict["sentence"] = item[0]
    sentenceDict["location-value-pair"] = {item[1][1]: item[2][1]}
    sentenceDict["parsedSentence"] = sample.replace(item[1][1], 'LOCATION_SLOT').\
                                            replace(format(item[2][1], ','), 'NUMBER_SLOT')
    sentenceList.append(sentenceDict)

输出 （of sentenceList）

[{'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than NUMBER_SLOT tons of cargo.", 'location-value-pair': {'Hong Kong': 7734}}
 {'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights 92,630 passengers , and more than 7,734 tons of cargo.", 'location-value-pair': {'Hong Kong': 592}}
 {'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights NUMBER_SLOT passengers , and more than 7,734 tons of cargo.", 'location-value-pair': {'Hong Kong': 92630}}]

Answer 3

我已经解决了我的用例，但使用了一种迂回的方式。

我首先允许包含多个LOCATION_SLOT或NUMBER_SLOT的插槽句子 - 如果组合中的一个元组包含2个或更多个插槽，我会填写所有：

sentences2location2values = []

for locationTokenIDs, location in tokenIDs2location.items():
                    for numberTokenIDs, number in tokenIDs2number.items():    
                        sentenceDict = {}    
                        sentenceDict["sentence"] = sample    
                        sentenceDict["location-value-pair"] = {location:number}  
                        for locationTokenID in locationTokenIDs:
                            sampleTokens[locationTokenID] = "LOCATION_SLOT"

                        for numberTokenID in numberTokenIDs:
                            sampleTokens[numberTokenID] = "NUMBER_SLOT"

                    slotSentence = (" ").join(sampleTokens)
                    sentenceDict["parsedSentence"] = slotSentence
                    sentences2location2values.append(sentenceDict)

然后，我更改解析的句子以删除连续的位置和数字槽：

for i,sentence in enumerate(sentences2location2values):
        sampleTokens = sentence['parsedSentence'].split()
        newTokens = []
        for i,token in enumerate(sampleTokens):
            if i>0 and ((token == "LOCATION_SLOT" and sampleTokens[i-1]=="LOCATION_SLOT") or (token == "NUMBER_SLOT" and sampleTokens[i-1]=="NUMBER_SLOT")):
                continue
            else:
                newTokens.append(token)

        sentence['parsedSentence']=(' ').join(newTokens)

当长度超过一个时，元组的条件列表理解

3 个答案: