我有一个看起来像这样的数据框:
from flask import Flask, request
from threading import Thread
import json
app = Flask(__name__)
message_dict = {}
ids = -1
@app.route('/test_connection')
def connection_test():
global ids
print('Connection tested from: {}'.format(request.remote_addr))
ids += 1
id_sent = str(ids)
return '{}'.format(id_sent)
@app.route('/new_message', methods=['POST'])
def new_message():
global message_dict
message_text = request.form['message']
user_id = request.form['id']
try:
message_dict[user_id]['message'] = message_text
except:
message_dict[user_id] = None
message_dict[user_id]['message'] = message_text
return '0'
@app.route('/chat')
def chat():
return json.dumps(message_dict)
if __name__ == '__main__':
app.run(debug=True, host='0.0.0.0')
我需要一个函数,该函数可以查找每个国家/地区缺少的年份,并将一个NaN值添加到数据框中。
Country Year Value
USA 1991 22
USA 1992 3
USA 1993 10
China 1991 1
China 1993 15
Argentina 1991 6
Argentina 1992 4
我还需要创建一个仅基于拥有所有国家/地区值的年份的值的数据框。
Country Year Value
USA 1991 22
USA 1992 3
USA 1993 10
China 1991 1
China 1992 NaN
China 1993 15
Argentina 1991 6
Argentina 1992 4
Argentina 1993 NaN
答案 0 :(得分:2)
将DataFrame.set_index
与MultiIndex.from_product
一起用于DataFrame.reindex
:
df = df.set_index(['Country','Year'])
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux).reset_index()
print (df)
Country Year Value
0 Argentina 1991 6.0
1 Argentina 1992 4.0
2 Argentina 1993 NaN
3 China 1991 1.0
4 China 1992 NaN
5 China 1993 15.0
6 USA 1991 22.0
7 USA 1992 3.0
8 USA 1993 10.0
对于没有缺失值的组:
vals = df1.loc[df1['Value'].isna(), 'Country'].unique()
df2 = df1[~df1['Country'].isin(vals)]
print (df2)
Country Year Value
6 USA 1991 22.0
7 USA 1992 3.0
8 USA 1993 10.0
将DataFrame.unstack
与DataFrame.stack
结合使用:
s = df.set_index(['Country','Year']).unstack()
df1 = s.stack(dropna=False).reset_index()
print (df1)
Country Year Value
0 Argentina 1991 6.0
1 Argentina 1992 4.0
2 Argentina 1993 NaN
3 China 1991 1.0
4 China 1992 NaN
5 China 1993 15.0
6 USA 1991 22.0
7 USA 1992 3.0
8 USA 1993 10.0
对于每列的所有值,请使用DataFrame.dropna
:
df2 = s.dropna(axis=1).stack().reset_index()
print (df2)
Country Year Value
0 Argentina 1991 6.0
1 China 1991 1.0
2 USA 1991 22.0
编辑:
如果得到:
ValueError:无法处理非唯一的多索引!
这意味着Country
和Year
列没有唯一的组合:
print (df)
Country Year Value
0 USA 1991 22 <-duplicate USA, 1991
1 USA 1991 3 <-duplicate USA, 1991
2 USA 1993 10
3 China 1991 1
4 China 1993 15
5 Argentina 1991 6
6 Argentina 1992 4
解决方案是将set_index
更改为groupby
,并使用一些聚合函数,例如mean
,sum
来实现唯一组合:
df = df.groupby(['Country','Year']).mean()
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux).reset_index()
print (df)
Country Year Value
0 Argentina 1991 6.0
1 Argentina 1992 4.0
2 Argentina 1993 NaN
3 China 1991 1.0
4 China 1992 NaN
5 China 1993 15.0
6 USA 1991 12.5
7 USA 1992 NaN
8 USA 1993 10.0