Question

我有一个熊猫数据框，其中有两列city和country。 city和country都包含缺失值。考虑以下数据帧：

temp = pd.DataFrame({"country": ["country A", "country A", "country A", "country A", "country B","country B","country B","country B", "country C", "country C", "country C", "country C"],
                     "city": ["city 1", "city 2", np.nan, "city 2", "city 3", "city 3", np.nan, "city 4", "city 5", np.nan, np.nan, "city 6"]})

我现在想在剩下的数据框中，例如，用国家城市的模式填充NaN列中的city。对于国家A：曾经提到过城市1；提到城市2两次；因此，请用city等填充索引2的列city 2。

我做完

cities = [city for city in temp["country"].value_counts().index]
modes = temp.groupby(["country"]).agg(pd.Series.mode)
dict_locations = modes.to_dict(orient="index")
for k in dict_locations.keys():
     new_dict_locations[k] = dict_locations[k]["city"]

现在有了国家的价值和相应的城市模式，我面临两个问题：

首先：案例country C是双峰的-密钥包含两个条目。我希望此键以相等的概率引用每个条目。实际数据集具有多种模式，因此它将是len> 2的列表。

第二个：我被困在NaN中的city替换为与country中同一行的new_dict_locations单元格中的值相对应的值。用伪代码，这将是：`遍历'city'列；如果在“ temp [i，city]”位置找不到值，请在该行中获取“ country”的值（->“ country_tmp”）；将“ country_tmp”作为字典“ new_dict_locations”的键；如果键“ country_temp”处的字典是列表，则从该列表中随机选择一项；取返回值（->'city_tmp'），并用值'city_temp'将缺少的值（temp [i，city]）填充到单元格中。

我尝试使用.fillna()和.replace()的不同组合（并读了this和其他问题都无济于事。*有人可以给我指点吗？

非常感谢。

（注意：所引用的问题会根据字典替换一个单元格中的值；但是我的引用值在另一列中。）

**编辑** 执行temp["city"].fillna(temp['country'], inplace=True)和temp.replace({'city': dict_locations})会给我一个错误：TypeError: unhashable type: 'dict' [对于原始数据集，此错误是TypeError: unhashable type: 'numpy.ndarray'，但如果有人知道行踪，我将无法通过示例进行重现。的不同，我很高兴听到他们的想法。]

Answer 1

尝试使用map字典来创建new_dict_locations，并使用s再次映射到s上以从数组中选取值。最后，使用np.random.choice到s

fillna

注意：我认为可以通过dict理解将2 s = (temp.country.map(new_dict_locations) .map(lambda x: np.random.choice(x) if isinstance(x, np.ndarray) else x)) temp['city'] = temp.city.fillna(s) Out[247]: country city 0 country A city 1 1 country A city 2 2 country A city 2 3 country A city 2 4 country B city 3 5 country B city 3 6 country B city 3 7 country B city 4 8 country C city 5 9 country C city 6 10 country C city 5 11 country C city 6合并为一个。但是，这样做会导致随机性降低。

Answer 2

def get_mode(d):
    for k,v in d.items():
        if len(v)>1 and isinstance(v, np.ndarray):
            d[k]=np.random.choice(list(v), 1, p=[0.5 for i in range(len(v))])[0]
    return d

下面的字典是用于填充的字典。

new_dict_locations=get_mode(new_dict_locations)
keys=list(new_dict_locations.keys())
values=list(new_dict_locations.values())

# Filling happens here
temp.city=temp.city.fillna(temp.country).replace(keys, values)

这将提供所需的输出：

country    city
0   country A  city 1
1   country A  city 2
2   country A  city 2
3   country A  city 2
4   country B  city 3
5   country B  city 3
6   country B  city 3
7   country B  city 4
8   country C  city 5
9   country C  city 5
10  country C  city 5
11  country C  city 6

熊猫以第二列为基础，用另一列的方式替换了NaN

2 个答案: