当长度超过一个时,元组的条件列表理解

时间:2016-08-14 14:44:05

标签: python list dictionary tuples

我有一个带元组的句子,表示某个国家或地区的位置:

sample = In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than 7,734 tons of cargo.

然后:

tokenIDs2number = {(22,): 592.00, (25,): 92630.00,(34,): 7734.00}
tokenIDs2location = {(8,9): Hong Kong}

我需要对这些元组的不同组合,创建各种句子组合,我称之为句子句子:

In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.

In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights , NUMBER_SLOT passengers , and more than 7,734 tons of cargo.

In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than NUMBER_SLOT tons of cargo.

但是,我当前的代码基本上采用了元组中元素的组合,所以我有两个句子,如:

In the first 11 months of 2004 LOCATION_SLOT Kong 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.

In the first 11 months of 2004 Hong LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.

作为一个例子。

如何解决这个问题,以便当我有len>1的元组密钥时,我会根据我的愿望将该密钥中的所有插槽填入一个LOCATION或NUMBER个插槽?

当前代码:

 for locationTokenIDs, location in tokenIDs2location.items():
                    for numberTokenIDs, number in tokenIDs2number.items():    
                        sentenceDict = {}    
                        sentenceDict["sentence"] = sample    
                        sentenceDict["location-value-pair"] = {location:number}  
                        for locationTokenID in locationTokenIDs:
                            for numberTokenID in numberTokenIDs:                                   
                                finalTokens = cleanSample.split()
                                finalTokens[numberTokenID] = "NUMBER_SLOT"
                                finalTokens[locationTokenID] = "LOCATION_SLOT"   
                                slotSentence = (" ").join(finalTokens)
                                sentenceDict["parsedSentence"] = slotSentence

注意,我必须创建一个字典,它还跟踪每个插槽句子组合的位置 - 值对和原始句子。关键部分是生成正确的slotSentence

请注意,这只是一个例子,数字甚至可能是24000000,其中句子中的值为24 million,相同的万亿,百万,十亿和千。

如果这是不可能的,另一个选择是填写组合中的所有插槽

In the first 11 months of 2004 LOCATION_SLOT LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.

然后可能会调整句子以删除连续的插槽,但我的偏好是一次性完成所有操作。

3 个答案:

答案 0 :(得分:0)

当locationTokenID实际上表示应被视为插槽的标记切片的端点时,代码将每个locationTokenID视为一个插槽。因此,我们需要删除for locationTokenID in locationTokenIDs:循环(它在每个locationTokenID上循环,就像它是一个插槽一样),并将由该对locationTokenID定义的相应单词切换替换为单个插槽。

以下代码解决了OP中解决的问题,但其他问题仍然存在(例如只保留了最后生成的slotSentence;我会让你解决这个问题,因为我不知道你想要什么数据结构存储插槽句子:)

sample = "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than 7,734 tons of cargo."

tokenIDs2number = {(21,): 592, (24,): 92630,(30,): 7734}
tokenIDs2location = {(7,8): 'Hong Kong'}

for locationTokenIDs, location in tokenIDs2location.items():
    for numberTokenIDs, number in tokenIDs2number.items():    
        sentenceDict = {}    
        sentenceDict["sentence"] = sample    
        sentenceDict["location-value-pair"] = {location:number}  
        for numberTokenID in numberTokenIDs:                                   
            finalTokens = sample.split()
            finalTokens[numberTokenID] = "NUMBER_SLOT"
            finalTokens[locationTokenIDs[0]:(locationTokenIDs[1]+1)] = "LOCATION_SLOT"   
            slotSentence = (" ").join(finalTokens)
            sentenceDict["parsedSentence"] = slotSentence
            print(slotSentence)

<强>输出:

  

2004年前11个月 L O C A T I O N _ S L O T   赤Kok角国际机场每日平均处理一次    NUMBER_SLOT 次航班,92,630名乘客,超过7,734吨   货物。   

2004年前11个月 L O C A T I O N _ S L O T   赤Kok角的国际机场平均每天处理一次   592个航班, NUMBER_SLOT 乘客,超过7,734吨   货物。

2004年前11个月 L O C A T I O N _ S L O T   赤Kok角的国际机场平均每天处理一次   592个航班,92,630名乘客,超过 NUMBER_SLOT 吨   货物。

这可以扩展到适用于包含任意数量空格的位置和数字。我们通过让numberTokenIDs和locationTokenIDs为2长度元组来实现这一点,为每个位置/数字指定一系列标记:

sample = "In the first 11 months of 2004 Hong Kong Central 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92 630 passengers , and more than 7 734 tons of cargo."

tokenIDs2number = {(22,22): '592', (25,26): '92 630',(32,33): '7 734'}
tokenIDs2location = {(7,9): 'Hong Kong Central'}

for locationTokenIDs, location in tokenIDs2location.items():
    for numberTokenIDs, number in tokenIDs2number.items():    
        finalTokens = sample.split()
        finalTokens[numberTokenIDs[0]:(numberTokenIDs[1]+1)] = "NUMBER_SLOT"
        finalTokens[locationTokenIDs[0]:(locationTokenIDs[1]+1)] = "LOCATION_SLOT"   
        slotSentence = (" ").join(finalTokens)
        print(slotSentence)

<强>输出:

  

2004年前11个月** L O C A T I O N _ S L O T **   赤Kok角国际机场每日平均处理592个   航班,** N U M B E R _ S L O T **乘客,超过7 734吨   货物。

     

2004年前11个月** L O C A T I O N _ S L O T **   赤Kok角国际机场每日平均处理592个   航班,92 630名乘客,超过** N U M B E R _ S L O T **吨   货物。

     

2004年前11个月** L O C A T I O N _ S L O T **   赤Kok角国际机场每日平均处理** N U.   M B E R _ S L O T **航班,92 630名乘客,超过7 734名   吨货物。

答案 1 :(得分:0)

考虑使用str.replace()而不是分割和切片句子字符串。为此,您需要将tokenID2number中的元素转换为千位分隔符,因为对于Python 2.7 +,可以使用format(int, ',')处理@JonClements注释:

sample = "In the first 11 months of 2004 Hong Kong 's international airport " + \
         "at Chek Lap Kok handled daily an average of 592 flights " + \
         "92,630 passengers , and more than 7,734 tons of cargo."    
tokenIDs2number = {(22,): 592, (25,): 92630,(34,): 7734}
tokenIDs2location = {(8,9): 'Hong Kong'}

sentenceList = []
# ITERATE ACROSS A LIST COMPREHENSION FOR ALL POSSIBLE COMBINATIONS
for item in [[s,i,j] for s in [sample] \
                     for i in tokenIDs2location.items() \
                     for j in tokenIDs2number.items()]:
    sentenceDict = {}  
    sentenceDict["sentence"] = item[0]
    sentenceDict["location-value-pair"] = {item[1][1]: item[2][1]}
    sentenceDict["parsedSentence"] = sample.replace(item[1][1], 'LOCATION_SLOT').\
                                            replace(format(item[2][1], ','), 'NUMBER_SLOT')
    sentenceList.append(sentenceDict)

输出 (of sentenceList)

[{'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than NUMBER_SLOT tons of cargo.", 'location-value-pair': {'Hong Kong': 7734}}
 {'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights 92,630 passengers , and more than 7,734 tons of cargo.", 'location-value-pair': {'Hong Kong': 592}}
 {'sentence': "In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights 92,630 passengers , and more than 7,734 tons of cargo.", 'parsedSentence': "In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights NUMBER_SLOT passengers , and more than 7,734 tons of cargo.", 'location-value-pair': {'Hong Kong': 92630}}]

答案 2 :(得分:0)

我已经解决了我的用例,但使用了一种迂回的方式。

我首先允许包含多个LOCATION_SLOTNUMBER_SLOT的插槽句子 - 如果组合中的一个元组包含2个或更多个插槽,我会填写所有:

sentences2location2values = []

for locationTokenIDs, location in tokenIDs2location.items():
                    for numberTokenIDs, number in tokenIDs2number.items():    
                        sentenceDict = {}    
                        sentenceDict["sentence"] = sample    
                        sentenceDict["location-value-pair"] = {location:number}  
                        for locationTokenID in locationTokenIDs:
                            sampleTokens[locationTokenID] = "LOCATION_SLOT"

                        for numberTokenID in numberTokenIDs:
                            sampleTokens[numberTokenID] = "NUMBER_SLOT"

                    slotSentence = (" ").join(sampleTokens)
                    sentenceDict["parsedSentence"] = slotSentence
                    sentences2location2values.append(sentenceDict)

然后,我更改解析的句子以删除连续的位置和数字槽:

for i,sentence in enumerate(sentences2location2values):
        sampleTokens = sentence['parsedSentence'].split()
        newTokens = []
        for i,token in enumerate(sampleTokens):
            if i>0 and ((token == "LOCATION_SLOT" and sampleTokens[i-1]=="LOCATION_SLOT") or (token == "NUMBER_SLOT" and sampleTokens[i-1]=="NUMBER_SLOT")):
                continue
            else:
                newTokens.append(token)

        sentence['parsedSentence']=(' ').join(newTokens)