为什么索引会导致pandas dataframe group由agg方法引起类型错误?

时间:2017-07-28 16:59:41

标签: python pandas typeerror

我正在使用以下代码构建聚合:

import numpy
import pandas

orders = pandas.read_csv(
    "orders.csv",
    dtype={
        "order_id": numpy.int32,
        "user_id": numpy.int32,
        "eval_set": "category",
        "order_number": numpy.int8,
        "order_dow": numpy.int8,
        "order_hour_of_day": numpy.int8,
        "days_since_prior_order": numpy.float64
    }
)

orders.set_index('order_id', inplace=True, drop=False)

prior_order_products = pandas.read_csv(
    "order_products__prior.csv",
    dtype={
        "order_id": numpy.int32,
        "product_id": numpy.int32,
        "add_to_cart_order": numpy.int16,
        "reordered": numpy.int8
    }
)

prior_order_products.set_index(['order_id', 'product_id'], inplace=True, drop=False)

prior_order_products = prior_order_products.join(orders, how="inner", on='order_id', rsuffix='_')
prior_order_products.drop('order_id_', inplace=True, axis=1)

del orders

prior_order_products['user_product_id'] =\
    100000 * prior_order_products["user_id"].astype(numpy.int64) + prior_order_products["product_id"]

user_products = prior_order_products.\
    groupby('user_product_id', sort=False).\
    agg({'order_id': ['size', 'last'], 'add_to_cart_order': 'sum'})

它出现以下错误:

Traceback (most recent call last):
  File "C:/Users/Strategy/PycharmProjects/Test/Main.py", line 52, in <module>
    agg({'order_id': ['size', 'last'], 'add_to_cart_order': 'sum'})
  ...
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'

如果我对该行发表评论

,我可以将错误消除
prior_order_products.set_index(['order_id', 'product_id'], inplace=True, drop=False)

另外,如果我将读取的行数限制为prior_order_products,我可以解决错误。文件格式不正确,没有数据丢失或格式错误。

错误究竟意味着什么?它与prior_order_products上的索引有什么关系?它与prior_order_products中的行数有什么关系?

0 个答案:

没有答案