过滤并添加NaN值行

时间:2019-03-16 18:13:44

标签: pandas dataframe filter nan

我有一个看起来像这样的数据框:

from flask import Flask, request
from threading import Thread
import json

app = Flask(__name__)

message_dict = {}
ids = -1


@app.route('/test_connection')
def connection_test():
    global ids

    print('Connection tested from: {}'.format(request.remote_addr))

    ids += 1
    id_sent = str(ids)
    return '{}'.format(id_sent)

@app.route('/new_message', methods=['POST'])
def new_message():
    global message_dict

    message_text = request.form['message']
    user_id = request.form['id']

    try:
        message_dict[user_id]['message'] = message_text
    except:
        message_dict[user_id] = None
        message_dict[user_id]['message'] = message_text

    return '0'

@app.route('/chat')
def chat():
    return json.dumps(message_dict)

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0')

我需要一个函数,该函数可以查找每个国家/地区缺少的年份,并将一个NaN值添加到数据框中。

Country     Year    Value
USA         1991     22
USA         1992     3
USA         1993     10
China       1991     1
China       1993     15
Argentina   1991     6
Argentina   1992     4

我还需要创建一个仅基于拥有所有国家/地区值的年份的值的数据框。

Country     Year    Value
USA         1991     22
USA         1992     3
USA         1993     10
China       1991     1
China       1992     NaN
China       1993     15
Argentina   1991     6
Argentina   1992     4
Argentina   1993     NaN

1 个答案:

答案 0 :(得分:2)

DataFrame.set_indexMultiIndex.from_product一起用于DataFrame.reindex

df = df.set_index(['Country','Year'])
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux).reset_index()
print (df)
     Country  Year  Value
0  Argentina  1991    6.0
1  Argentina  1992    4.0
2  Argentina  1993    NaN
3      China  1991    1.0
4      China  1992    NaN
5      China  1993   15.0
6        USA  1991   22.0
7        USA  1992    3.0
8        USA  1993   10.0

对于没有缺失值的组:

vals = df1.loc[df1['Value'].isna(), 'Country'].unique()
df2 = df1[~df1['Country'].isin(vals)]
print (df2)
  Country  Year  Value
6     USA  1991   22.0
7     USA  1992    3.0
8     USA  1993   10.0

DataFrame.unstackDataFrame.stack结合使用:

s = df.set_index(['Country','Year']).unstack()
df1 = s.stack(dropna=False).reset_index()
print (df1)
     Country  Year  Value
0  Argentina  1991    6.0
1  Argentina  1992    4.0
2  Argentina  1993    NaN
3      China  1991    1.0
4      China  1992    NaN
5      China  1993   15.0
6        USA  1991   22.0
7        USA  1992    3.0
8        USA  1993   10.0

对于每列的所有值,请使用DataFrame.dropna

df2 = s.dropna(axis=1).stack().reset_index()
print (df2)
     Country  Year  Value
0  Argentina  1991    6.0
1      China  1991    1.0
2        USA  1991   22.0

编辑:

如果得到:

  

ValueError:无法处理非唯一的多索引!

这意味着CountryYear列没有唯一的组合:

print (df)
     Country  Year  Value
0        USA  1991     22 <-duplicate USA, 1991
1        USA  1991      3 <-duplicate USA, 1991
2        USA  1993     10
3      China  1991      1
4      China  1993     15
5  Argentina  1991      6
6  Argentina  1992      4

解决方案是将set_index更改为groupby,并使用一些聚合函数,例如meansum来实现唯一组合:

df = df.groupby(['Country','Year']).mean()
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux).reset_index()
print (df)
     Country  Year  Value
0  Argentina  1991    6.0
1  Argentina  1992    4.0
2  Argentina  1993    NaN
3      China  1991    1.0
4      China  1992    NaN
5      China  1993   15.0
6        USA  1991   12.5
7        USA  1992    NaN
8        USA  1993   10.0