pandas数据框的某些列没有。唯一值(例如4)。这些值将在所有行中都有一些初始比例。我需要更改以提供所需比例的输入。假设我有100行,并且列名city
的值具有给定的比例。
Mumbai 30%
Kolkata 40%
Chennai 10%
Delhi 20%
现在,我需要更改列中的值,以便获得所需的比例(或数据结构)。
Mumbai 20%
Kolkata 50%
Chennai 20%
Delhi 10%
在执行此操作时,我想确保在将具有城市Mumbai
的行的值从25%更改为20%时,我应该将其中的20%保持与以前相同,仅更改其余部分5 %,即不清除所有值并按照新比例填充。我正在熊猫数据库中尝试这样做。任何帮助表示赞赏。
编辑:所以说我的专栏看起来像是10行。
1 Mumbai
2 Mumbai
3 Mumbai
4 Kolkata
5 Kolkata
6 Kolkata
7 Kolkata
8 Chennai
9 Delhi
10 Delhi
现在,我希望对它进行一些更改,例如上述更改。
1 Mumbai
2 Mumbai
3 Kolkata
4 Kolkata
5 Kolkata
6 Kolkata
7 Kolkata
8 Chennai
9 Chennai
10 Delhi
我不是随机的。孟买的新行是最后一个的子集。
答案 0 :(得分:0)
from collections import Counter
import pandas as pd
def set_proportion(df, column, new_proportion):
proportion = (df[column].value_counts() / df.shape[0]).to_dict()
prop_diff = {key: new_proportion[key] - proportion[key] for key in new_proportion}
prop_diff_cnt = {key: int(round(value * df.shape[0])) for key, value in prop_diff.items()}
to_add = {key: diff for key, diff in prop_diff_cnt.items() if diff > 0}
to_remove = {key: diff for key, diff in prop_diff_cnt.items() if diff < 0}
to_add = sum(([key] * diff for key, diff in to_add.items()), [])
to_remove = sum(([key] * -diff for key, diff in to_remove.items()), [])
# group to counter to do updates to the dataframe in bulk, one update per each *unique* replacement pair
counter = Counter(list(zip(to_remove, to_add)))
for (remove, add), count in counter.items():
df.loc[df[df[column] == remove].iloc[-count:].index, column] = add
df = pd.DataFrame(["Mumbai"] * 3 + ["Kolkata"] * 4 + ["Chennai"] + ["Delhi"] * 2, columns=['city'])
print df
city
0 Mumbai
1 Mumbai
2 Mumbai
3 Kolkata
4 Kolkata
5 Kolkata
6 Kolkata
7 Chennai
8 Delhi
9 Delhi
set_proportion(df, 'city', {'Mumbai': 0.2, 'Kolkata': 0.5, 'Chennai': 0.2, 'Delhi': 0.1})
print df
city
0 Mumbai
1 Mumbai
2 Chennai
3 Kolkata
4 Kolkata
5 Kolkata
6 Kolkata
7 Chennai
8 Delhi
9 Kolkata
# set_proportion modifies the original dataframe, so we need to reinitialize it
df = pd.DataFrame(["Mumbai"] * 3 + ["Kolkata"] * 4 + ["Chennai"] + ["Delhi"] * 2, columns=['city'])
set_proportion(df, 'city', {'Mumbai': 0.2, 'Kolkata': 0.1, 'Chennai': 0.3, 'Delhi': 0.4})
print df
city
0 Mumbai
1 Mumbai
2 Delhi
3 Kolkata
4 Delhi
5 Chennai
6 Chennai
7 Chennai
8 Delhi
9 Delhi