我有两个类似的数据框, 这是输入的csv数据。
Document_ID OFFSET PredictedFeature
0 0 2000
0 8 2000
0 16 2200
0 23 2200
0 30 2200
1 0 2100
1 5 2100
1 7 2100
现在我也有输出数据
Document_ID OFFSET PredictedFeature
0 0 2000
0 8 2100
0 16 2100
0 23 2100
0 30 2200
1 0 2000
1 5 2000
1 7 2100
现在,在这里我要尝试的是匹配结果(无论是否获得)。
所以我做到了,
df1_inputPredictedFeature_column['new'] = df1_inputPredictedFeature_column['PredictedFeature'] == df1_predictedFeature_column['PredictedFeature']
这将添加一列,以告诉天气它是否与预测的功能列匹配。
现在我正在尝试的是
总共有2个特征,其中输入csv的预测特征为2000。但是在输出csv中,它仅匹配第一位,而不是第二位。
所以我正在尝试获取像这样的数据,
predictedFeatureClass inputCsvOccured outputcsvmatched
2000 2 1
2200 3 1
那么,我将如何获得这些数据?任何帮助都会很棒。
答案 0 :(得分:0)
一个想法是将new
列通过Series.view
转换为整数,然后通过元组列表将new
和size
的列sum
聚合以指定新列名称:
df1['new'] = (df1['PredictedFeature'] == df2['PredictedFeature']).view('i1')
df = (df1.groupby("PredictedFeature")['new']
.agg([('inputCsvOccured','size'), ('outputcsvmatched','sum')])
.reset_index())
print (df)
PredictedFeature inputCsvOccured outputcsvmatched
0 2000 2 1
1 2100 3 1
2 2200 3 1
Pandas 0.25+解决方案:
df1['new'] = (df1['PredictedFeature'] == df2['PredictedFeature']).view('i1')
df = (df1.groupby("PredictedFeature")
.agg(inputCsvOccured=pd.NamedAgg(column='new', aggfunc='size'),
outputcsvmatched=pd.NamedAgg(column='new', aggfunc='sum'))
.reset_index())
答案 1 :(得分:0)
您可以使用groupby进行操作
df1_inputPredictedFeature_column = pd.DataFrame([['0', '0', '2000'], ['0', '8', '2000'], ['0', '16', '2200'], ['0', '23', '2200'], ['0', '30', '2200'], ['1', '0', '2100'], ['1', '5', '2100'], ['1', '7', '2100']], columns=('Document_ID', 'OFFSET', 'PredictedFeature'))
df1_predictedFeature_column = pd.DataFrame([['0', '0', '2000'], ['0', '8', '2100'], ['0', '16', '2100'], ['0', '23', '2100'], ['0', '30', '2200'], ['1', '0', '2000'], ['1', '5', '2000'], ['1', '7', '2100']], columns=('Document_ID', 'OFFSET', 'PredictedFeature'))
df1_inputPredictedFeature_column['new'] = (df1_inputPredictedFeature_column['PredictedFeature'] == df1_predictedFeature_column['PredictedFeature']).astype(np.int)
result = df1_inputPredictedFeature_column.groupby("PredictedFeature").agg({"PredictedFeature":"count", "new":np.sum})
result.columns = ["inputCsvOccured", "outputcsvmatched"]
result.index.name = "predictedFeatureClass"
result.reset_index(inplace=True)
print(result)
结果
predictedFeatureClass inputCsvOccured outputcsvmatched
0 2000 2 1
1 2100 3 1
2 2200 3 1