当我使用pd.crosstab时,它会一直显示AssertionError

时间:2018-02-18 20:01:57

标签: python pandas knn

当我使用pd.crosstab构建混淆矩阵时,它会一直显示

AssertionError: arrays and names must have the same length
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
import random

df = pd.read_csv('C:\\Users\\liukevin\\Desktop\\winequality-red.csv',sep=';', usecols=['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol','quality'])

Q=[]

for i in range(len(df)):
    if df['quality'][i]<=5:
        Q.append('Low')
    else:
        Q.append('High')

del df['quality']
test_number=sorted(random.sample(xrange(len(df)), int(len(df)*0.25)))
train_number=[]
temp=[]
for i in range(len(df)):
    temp.append(i)
train_number=list(set(temp)-set(test_number))

distance_all=[]
for i in range(len(test_number)):
    distance_sep=[]
    for j in range(len(train_number)):
        distance=pow(df['fixed acidity'][test_number[i]]-df['fixed acidity'][train_number[j]],2)+\
        pow(df['volatile acidity'][test_number[i]]-df['volatile acidity'][train_number[j]],2)+\
        pow(df['citric acid'][test_number[i]]-df['citric acid'][train_number[j]],2)+\
        pow(df['residual sugar'][test_number[i]]-df['residual sugar'][train_number[j]],2)+\
        pow(df['chlorides'][test_number[i]]-df['chlorides'][train_number[j]],2)+\
        pow(df['free sulfur dioxide'][test_number[i]]-df['free sulfur dioxide'][train_number[j]],2)+\
        pow(df['total sulfur dioxide'][test_number[i]]-df['total sulfur dioxide'][train_number[j]],2)+\
        pow(df['density'][test_number[i]]-df['density'][train_number[j]],2)+\
        pow(df['pH'][test_number[i]]-df['pH'][train_number[j]],2)+\
        pow(df['sulphates'][test_number[i]]-df['sulphates'][train_number[j]],2)+\
        pow(df['alcohol'][test_number[i]]-df['alcohol'][train_number[j]],2)
        distance_sep.append(distance)
    distance_all.append(distance_sep)

for round in range(5):
    K=2*round+1

    select_neighbor_all=[]
    for i in range(len(test_number)):
        select_neighbor_sep=np.argsort(distance_all[i])[:K]
        select_neighbor_all.append(select_neighbor_sep)

    prediction=[]
    Q_test=[]
    for i in range(len(test_number)):
        Q_test.append(Q[test_number[i]])
        #original data
        Low_count=0
        for j in range(K):
            if Q[train_number[select_neighbor_all[i][j]]]=='Low':
                Low_count+=1
        if Low_count>(K/2):
            prediction.append('Low')
        else:
            prediction.append('High')

    print pd.crosstab(Q_test, prediction, rownames=['Actual'], colnames=['Predicted'], margins=True)

Q_testprediction的长度不一样吗? 我想这可能是"names" must have the same length的问题,因为我不确定它的含义。 (在Q_testprediction数组中,只有二进制元素'Low''High'。) select_neighbor_all是我选择K ith测试数据的最近邻居。

2 个答案:

答案 0 :(得分:0)

您可能无法提供pd.crosstab执行必要计算所需的所有数据:

看看这个例子。这里我们提供索引和两个列类别AND rownames和colnames:

>>> index = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
...                   "bar", "bar", "foo", "foo", "foo"], dtype=object)
>>> col_category_1 = np.array(["one", "one", "one", "two", "one", "one",
...                            "one", "two", "two", "two", "one"], dtype=object)
>>> col_category_2 = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
...                            "shiny", "dull", "shiny", "shiny", "shiny"],
...                            dtype=object)


# Notice the index AND the columns provided as a list    
>>> pd.crosstab(index, [col_category_1, col_category_2], 
                    rownames=['a'], colnames=['b', 'c'])
... 
col_category_1   one        two
col_category_2   dull shiny dull shiny
index
bar              1     2    1     0
foo              2     2    1     2

有关详细信息,请参阅pd.crosstab的{​​{3}}:

  

index:数组,系列或数组/系列列表       要在行中分组的值

     

列:数组,系列或数组/系列列表       要在

列中分组的值      

rownames:序列,默认无       如果通过,则必须匹配传递的行数组

     

colnames:sequence,默认无       如果通过,则必须匹配传递的列数组

如果您编辑以下行并包含正确的输入,则应解决您的问题......

# You will need to provide an index and columns...
# Here, 'Q_test' is being interpreted as your index
# 'prediction' is being used as a column... 
pd.crosstab(Q_test, prediction, 
            rownames=['Actual'], 
            colnames=['Predicted'],
            margins=True)

答案 1 :(得分:0)

我只是花了一些时间解决这个问题。在我的情况下,熊猫交叉表似乎不适用于列表。

如果将列表转换为numpy数组,它应该可以正常工作。

因此,您的情况应该是:

pd.crosstab(np.array(Q_test), np.array(prediction), rownames=['Actual'],
            colnames=['Predicted'], margins=True)

一个例子:

>>> import pandas as pd
>>> import numpy as np
>>> classifications = ['foo', 'bar', 'foo', 'bar']
>>> predictions = ['foo', 'foo', 'bar', 'bar']
>>> pd.crosstab(classifications, predictions, rownames=['Actual'], colnames=['Predicted'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bastian/miniconda3/envs/machine_learning/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 563, in crosstab
    rownames = _get_names(index, rownames, prefix="row")
  File "/home/bastian/miniconda3/envs/machine_learning/lib/python3.6/site-packages/pandas/core/reshape/pivot.py", line 703, in _get_names
    raise AssertionError("arrays and names must have the same length")
AssertionError: arrays and names must have the same length
>>> pd.crosstab(np.array(classifications), np.array(predictions), rownames=['Actual'], colnames=['Predicted'])
Predicted  bar  foo
Actual             
bar          1    1
foo          1    1

之所以会发生这种情况,是因为我认为像乘法这样的某些操作对列表的影响与对numpy的影响不同。