创建多索引数据框

时间:2020-09-22 18:12:40

标签: pandas dataframe multi-index

我不知道如何创建多索引df(具有不相等的2nd索引数)。这是一个示例:

data = [{'caterpillar': [('Сatérpillar',
    {'fuzz': 0.82,
     'levenshtein': 0.98,
     'jaro_winkler': 0.9192,
     'hamming': 0.98}),
   ('caterpiⅼⅼaʀ',
    {'fuzz': 0.73,
     'levenshtein': 0.97,
     'jaro_winkler': 0.9114,
     'hamming': 0.97}),
   ('cÂteԻpillÂr',
    {'fuzz': 0.73,
     'levenshtein': 0.97,
     'jaro_winkler': 0.881,
     'hamming': 0.97})]},
 {'elementis': [('elEmENtis',
    {'fuzz': 1.0, 'levenshtein': 1.0, 'jaro_winkler': 1.0, 'hamming': 1.0}),
   ('ÊlemĚntis',
    {'fuzz': 0.78,
     'levenshtein': 0.98,
     'jaro_winkler': 0.863,
     'hamming': 0.98}),
   ('еlÈmÈntis',
    {'fuzz': 0.67,
     'levenshtein': 0.97,
     'jaro_winkler': 0.8333,
     'hamming': 0.97})]},
 {'gibson': [('giBᏚon',
    {'fuzz': 0.83,
     'levenshtein': 0.99,
     'jaro_winkler': 0.9319,
     'hamming': 0.99}),
   ('ɡibsoN',
    {'fuzz': 0.83,
     'levenshtein': 0.99,
     'jaro_winkler': 0.9206,
     'hamming': 0.99}),
   ('giЬႽon',
    {'fuzz': 0.67,
     'levenshtein': 0.98,
     'jaro_winkler': 0.84,
     'hamming': 0.98}),
   ('glbsՕn',
    {'fuzz': 0.67,
     'levenshtein': 0.98,
     'jaro_winkler': 0.8333,
     'hamming': 0.98})]}]

我想要这样的df(注意:“其他名称”的每个“来源名称”都有不同的值数量:

Orig Name| Other Name| fuzz| levenstein| Jaro-Winkler| Hamming
------------------------------------------------------------------------
caterpillar  Сatérpillar  0.82   0.98.      0.9192        0.98
             caterpiⅼⅼaʀ  0.73   0.97       0.9114        0.97
             cÂteԻpillÂr  0.73   0.97       0.881         0.97
gibson       giBᏚon       0.83.  0.99       0.9319        0.99
             ɡibsoN       0.83   0.99.      0.9206        0.99
             giЬႽon       0.67.  0.98       0.84          0.98
             glbsՕn       0.67.  0.98.      0.8333        0.98
elementis .........
--------------------------------------------------------------------------

我尝试过:

orig_name_list = [x for d in data for x, v in d.items()]
value_list = [v for d in data for x, v in d.items()]
other_names = [tup[0] for tup_list in value_list for tup in tup_list]
algos = ['fuzz', 'levenshtein', 'jaro_winkler', 'hamming']

不确定如何从那里继续。建议表示赞赏。

2 个答案:

答案 0 :(得分:2)

让我们尝试concat:

pd.concat([pd.DataFrame([x[1]]).assign(OrigName=k, OtherName=x[0]) 
               for df in data for k,d in df.items() for x in d])

输出:

   fuzz  levenshtein  jaro_winkler  hamming     OrigName    OtherName
0  0.82         0.98        0.9192     0.98  caterpillar  Сatérpillar
0  0.73         0.97        0.9114     0.97  caterpillar  caterpiⅼⅼaʀ
0  0.73         0.97        0.8810     0.97  caterpillar  cÂteԻpillÂr
0  1.00         1.00        1.0000     1.00    elementis    elEmENtis
0  0.78         0.98        0.8630     0.98    elementis    ÊlemĚntis
0  0.67         0.97        0.8333     0.97    elementis    еlÈmÈntis
0  0.83         0.99        0.9319     0.99       gibson       giBᏚon
0  0.83         0.99        0.9206     0.99       gibson       ɡibsoN
0  0.67         0.98        0.8400     0.98       gibson       giЬႽon
0  0.67         0.98        0.8333     0.98       gibson       glbsՕn

答案 1 :(得分:1)

一种方法是通过pd.json_normalize函数将数据重新格式化为JSON记录使用。您的json当前格式不正确,无法轻松存储到数据框中:

new_data = []
for entry in data:
    new_entry = {}
    for name, matches in entry.items():
        new_entry["name"] = name
        new_entry["matches"] = []
        for match in matches:
            match[1]["match"] = match[0]
            new_entry["matches"].append(match[1])
    new_data.append(new_entry)


df = pd.json_normalize(new_data, "matches", ["name"]).set_index(["name", "match"])

print(df)
                         fuzz  levenshtein  jaro_winkler  hamming
name        match                                                
caterpillar Сatérpillar  0.82         0.98        0.9192     0.98
            caterpiⅼⅼaʀ   0.73         0.97        0.9114     0.97
            cÂteԻpillÂr  0.73         0.97        0.8810     0.97
elementis   elEmENtis    1.00         1.00        1.0000     1.00
            ÊlemĚntis    0.78         0.98        0.8630     0.98
            еlÈmÈntis    0.67         0.97        0.8333     0.97
gibson      giBᏚon       0.83         0.99        0.9319     0.99
            ɡibsoN       0.83         0.99        0.9206     0.99
            giЬႽon       0.67         0.98        0.8400     0.98
            glbsՕn       0.67         0.98        0.8333     0.98