当我使用pd.crosstab
构建混淆矩阵时,它会一直显示
AssertionError: arrays and names must have the same length
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
import random
df = pd.read_csv('C:\\Users\\liukevin\\Desktop\\winequality-red.csv',sep=';', usecols=['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol','quality'])
Q=[]
for i in range(len(df)):
if df['quality'][i]<=5:
Q.append('Low')
else:
Q.append('High')
del df['quality']
test_number=sorted(random.sample(xrange(len(df)), int(len(df)*0.25)))
train_number=[]
temp=[]
for i in range(len(df)):
temp.append(i)
train_number=list(set(temp)-set(test_number))
distance_all=[]
for i in range(len(test_number)):
distance_sep=[]
for j in range(len(train_number)):
distance=pow(df['fixed acidity'][test_number[i]]-df['fixed acidity'][train_number[j]],2)+\
pow(df['volatile acidity'][test_number[i]]-df['volatile acidity'][train_number[j]],2)+\
pow(df['citric acid'][test_number[i]]-df['citric acid'][train_number[j]],2)+\
pow(df['residual sugar'][test_number[i]]-df['residual sugar'][train_number[j]],2)+\
pow(df['chlorides'][test_number[i]]-df['chlorides'][train_number[j]],2)+\
pow(df['free sulfur dioxide'][test_number[i]]-df['free sulfur dioxide'][train_number[j]],2)+\
pow(df['total sulfur dioxide'][test_number[i]]-df['total sulfur dioxide'][train_number[j]],2)+\
pow(df['density'][test_number[i]]-df['density'][train_number[j]],2)+\
pow(df['pH'][test_number[i]]-df['pH'][train_number[j]],2)+\
pow(df['sulphates'][test_number[i]]-df['sulphates'][train_number[j]],2)+\
pow(df['alcohol'][test_number[i]]-df['alcohol'][train_number[j]],2)
distance_sep.append(distance)
distance_all.append(distance_sep)
for round in range(5):
K=2*round+1
select_neighbor_all=[]
for i in range(len(test_number)):
select_neighbor_sep=np.argsort(distance_all[i])[:K]
select_neighbor_all.append(select_neighbor_sep)
prediction=[]
Q_test=[]
for i in range(len(test_number)):
Q_test.append(Q[test_number[i]])
#original data
Low_count=0
for j in range(K):
if Q[train_number[select_neighbor_all[i][j]]]=='Low':
Low_count+=1
if Low_count>(K/2):
prediction.append('Low')
else:
prediction.append('High')
print pd.crosstab(Q_test, prediction, rownames=['Actual'], colnames=['Predicted'], margins=True)
但Q_test
和prediction
的长度不一样吗?
我想这可能是"names" must have the same length
的问题,因为我不确定它的含义。
(在Q_test
和prediction
数组中,只有二进制元素'Low'
和'High'
。)
select_neighbor_all
是我选择K ith
测试数据的最近邻居。
答案 0 :(得分:0)
您可能无法提供pd.crosstab执行必要计算所需的所有数据:
看看这个例子。这里我们提供索引和两个列类别AND rownames和colnames:
>>> index = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
... "bar", "bar", "foo", "foo", "foo"], dtype=object)
>>> col_category_1 = np.array(["one", "one", "one", "two", "one", "one",
... "one", "two", "two", "two", "one"], dtype=object)
>>> col_category_2 = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
... "shiny", "dull", "shiny", "shiny", "shiny"],
... dtype=object)
# Notice the index AND the columns provided as a list
>>> pd.crosstab(index, [col_category_1, col_category_2],
rownames=['a'], colnames=['b', 'c'])
...
col_category_1 one two
col_category_2 dull shiny dull shiny
index
bar 1 2 1 0
foo 2 2 1 2
有关详细信息,请参阅pd.crosstab
的{{3}}:
index:数组,系列或数组/系列列表 要在行中分组的值
列:数组,系列或数组/系列列表 要在
列中分组的值rownames:序列,默认无 如果通过,则必须匹配传递的行数组
colnames:sequence,默认无 如果通过,则必须匹配传递的列数组
如果您编辑以下行并包含正确的输入,则应解决您的问题......
# You will need to provide an index and columns...
# Here, 'Q_test' is being interpreted as your index
# 'prediction' is being used as a column...
pd.crosstab(Q_test, prediction,
rownames=['Actual'],
colnames=['Predicted'],
margins=True)
答案 1 :(得分:0)
我只是花了一些时间解决这个问题。在我的情况下,熊猫交叉表似乎不适用于列表。
如果将列表转换为numpy数组,它应该可以正常工作。
因此,您的情况应该是:
pd.crosstab(np.array(Q_test), np.array(prediction), rownames=['Actual'],
colnames=['Predicted'], margins=True)
一个例子:
>>> import pandas as pd
>>> import numpy as np
>>> classifications = ['foo', 'bar', 'foo', 'bar']
>>> predictions = ['foo', 'foo', 'bar', 'bar']
>>> pd.crosstab(classifications, predictions, rownames=['Actual'], colnames=['Predicted'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bastian/miniconda3/envs/machine_learning/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 563, in crosstab
rownames = _get_names(index, rownames, prefix="row")
File "/home/bastian/miniconda3/envs/machine_learning/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 703, in _get_names
raise AssertionError("arrays and names must have the same length")
AssertionError: arrays and names must have the same length
>>> pd.crosstab(np.array(classifications), np.array(predictions), rownames=['Actual'], colnames=['Predicted'])
Predicted bar foo
Actual
bar 1 1
foo 1 1
之所以会发生这种情况,是因为我认为像乘法这样的某些操作对列表的影响与对numpy的影响不同。