Question

我有一个ProductDf，它有许多版本的同一产品。我想过滤产品的最后一次迭代。所以我这样做如下：

productIndexDf= ProductDf.groupby('productId').apply(lambda 
x:x['startDtTime'].reset_index()).reset_index()        

productToPick = productIndexDf.groupby('productId')['index'].max()

将productToPick的值转换为字符串

productIndex = productToPick.to_string(header=False, 
index=False).replace('\n',' ')
productIndex  = productIndex.split()

productIndex = list(map(int, productIndex))
productIndex.sort()

productIndexStr = ','.join(str(e) for e in productIndex)

一旦我在系列中得到它，我手动调用loc函数并且它可以工作：

filteredProductDf = ProductDf.iloc[[7,8],:]

如果我将字符串传递给它，我会收到错误：

filteredProductDf = ProductDf.iloc[productIndexStr,:]

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

我也试过这个：

filteredProductDf = ProductDf[productIndexStr]

但后来我遇到了这个问题：

KeyError: '7,8'

Answer 1

Pandas Dataframe iloc 方法仅适用于整数类型索引值。如果要使用字符串值作为索引来访问pandas数据帧中的数据，则必须使用Pandas Dataframe loc 方法。

从这些链接了解有关这些方法的更多信息。

Use of Pandas Dataframe iloc method

Use of Pandas Dataframe loc method

Answer 2

好的我觉得你很困惑。

给定一个如下所示的数据框：

   avgPrice productId startDtTime  totalSold
0      42.5      A001  01/05/2018        100
1      55.5      A001  02/05/2018        150
2      48.5      A001  03/05/2018        300
3      42.5      A002  01/05/2018        220
4      53.5      A002  02/05/2018        250

我假设你对第2行和第4行（各个productId的最后一个值）感兴趣。在熊猫中，最简单的方法是将drop_duplicates()与param keep='last'一起使用。考虑这个例子：

import pandas as pd

d = {'startDtTime': {0: '01/05/2018', 1: '02/05/2018', 
                     2: '03/05/2018', 3: '01/05/2018', 4: '02/05/2018'}, 
 'totalSold': {0: 100, 1: 150, 2: 300, 3: 220, 4: 250}, 
 'productId': {0: 'A001', 1: 'A001', 2: 'A001', 3: 'A002', 4: 'A002'}, 
 'avgPrice': {0: 42.5, 1: 55.5, 2: 48.5, 3: 42.5, 4: 53.5}
    } 

# Recreate dataframe
ProductDf = pd.DataFrame(d)

# Convert column with dates to datetime objects
ProductDf['startDtTime'] = pd.to_datetime(ProductDf['startDtTime'])

# Sort values by productId and startDtTime to ensure correct order
ProductDf.sort_values(by=['productId','startDtTime'], inplace=True)

# Drop the duplicates
ProductDf.drop_duplicates(['productId'], keep='last', inplace=True)

print(ProductDf)

你得到：

   avgPrice productId startDtTime  totalSold
2      48.5      A001  2018-03-05        300
4      53.5      A002  2018-02-05        250

将字符串传递给dataframe iloc

2 个答案: