Question

我有这个字典和这个数据框：

In [40]: 
atemp

Out[40]:
{0: ['adc telecommunications inc'],
 1: ['aflac inc'],
 2: ['agco corporation'],
 3: ['agl resources inc'],
 4: ['invesco ltd'],
 5: ['ak steel holding corporation'],
 6: ['amn healthcare services inc'],
 7: ['amr corporation']}

In [42]:

cemptemp


Out[42]:
Company name                               nstandar
   0    1-800-FLOWERS.COM                  1800flowerscom
   1    1347 PROPERTY INS HLDGS INC 1347   property ins hldgs inc
   2    1ST CAPITAL BANK                   1st capital bank
   3    1ST CENTURY BANCSHARES INC         1st century bancshares inc
   4    1ST CONSTITUTION BANCORP           1st constitution bancorp
   5    1ST ENTERPRISE BANK                1st enterprise bank
   6    1ST PACIFIC BANCORP                1st pacific bancorp
   7    1ST SOURCE CORP                    1st source corporation

使用我的代码，我使用字典的每个值来查找pandas数据帧的nstandar列的元素，其中jaccard与字典值的距离大于0.1并创建一个新的字典，其中键是前一个字典的值和值是根据jaccard距离选择的数据帧的值。

我已经尝试过这段代码，但它只给每个键一个值，我知道每个键应该有一个列表。

sd={ y : row['nstandar'] for k,value in atemp.iteritems() for y in value for index , row in cemptemp.iterrows() if jack(y,row['nstandar'])>=0.1}

所以sd是：

{'adc telecommunications inc': '1st century bancshares inc',
 'aflac inc': '1st century bancshares inc',
 'agco corporation': '1st source corporation',
 'agl resources inc': '1st century bancshares inc',
 'ak steel holding corporation': '1st source corporation',
 'amn healthcare services inc': '1st century bancshares inc',
 'amr corporation': '1st source corporation'}

但是，第一个密钥的预期输出为：'adc telecommunications inc' :[ '1347 property ins hldgs inc' , '1st century bancshares inc']

那么，我如何修复我的代码以获得我想要的东西呢？

编辑：jaccard距离的代码为：

def jack(a,b):
    x=a.split()
    y=b.split()
    xy = set(x+y)              
    return float(len(x)+len(y)-len(xy))/float(len(xy))

编辑2：我想出了一个解决方案：

from collections import defaultdict

td=defaultdict(list)

for k,value in atemp.iteritems():
    for y in value:
        for index , row in cemptemp.iterrows():
            if jack(y,row['nstandar'])>=0.1:
                td[y].append(row['nstandar'])

但是，如果尝试编写相同的代码但使用字典理解，则不起作用：

from collections import defaultdict

td=defaultdict(list)


td={y : td[y].append(row['nstandar']) for k,value in atemp.iteritems() for y in value for index , row in cemptemp.iterrows() if jack(y,row['nstandar'])>=0.1}

那么，我的解决方案与具有字典理解的代码之间的区别是什么？

Answer 1

在您的第一个代码版本中：

sd={ y : row['nstandar'] ...... }

row['nstandar']是一个字符串。结果是{str：str}，不能是你的期望。

你的第二个版本：

`{y : td[y].append(row['nstandar']) ......}`

td[y].append(...)是一个列表追加操作，返回值为None。所以它等于{y: None}

如果我理解您的需求，下面的代码可以正常工作：

from itertools import chain
{y: [row['nstandar'] for index, row in cemptemp.iterrows() if jack(y, row['nstandar'])>=0.1]
 for y in chain(*atemp.values())}

只有一个可能的区别：它还会在结果字典中添加'invesco ltd': []。如果你真的想在一行代码中过滤掉它，那么用{k代替k，v代表k，v代表XXXX.items（），如果len（v）＆gt; 0}。

但是，我不建议对这种复杂的逻辑进行理解。 Dict理解是为了易于编写和阅读的简洁代码。对于复杂的逻辑，它只会造成负面影响。在我看来，你的for循环解决方案更好。

在字典理解中列出每个键的值 - Python

1 个答案: