我在下面设置了数据。我正在尝试通过提供标签来确定客户类型。尝试时由于数据过多,我的excel崩溃了,所以尝试用Python完成。
item customer qty
------------------
ProdA CustA 1
ProdA CustB 1
ProdA CustC 1
ProdA CustD 1
ProdB CustA 1
ProdB CustB 1
在Excel中,我会:
1. Create new columns "ProdA", "ProdB", "Type"
2. Remove duplicates for column "customer"
3. COUNTIF Customer = ProdA, COUNTIF customer = ProdB
4. IF(AND(ProdA = 1, ProdB = 1), "Both", "One")
customer ProdA ProdB Type
--------------------------
CustA 1 1 Both
CustB 1 1 Both
CustC 1 0 One
CustD 1 0 One
答案 0 :(得分:2)
我们可以使用pd.crosstab
,然后使用ProdA
和ProdB
到Series.map
2 -> Both
和1 -> One
的总和来实现: / p>
dfn = pd.crosstab(df['customer'], df['item']).reset_index()
dfn['Type'] = dfn[['ProdA', 'ProdB']].sum(axis=1).map({2:'Both', 1:'One'})
或者我们可以在最后一行使用np.where
有条件地分配Both
或One
:
dfn['Type'] = np.where(dfn['ProdA'].eq(1) & dfn['ProdB'].eq(1), 'Both', 'One')
item customer ProdA ProdB Type
0 CustA 1 1 Both
1 CustB 1 1 Both
2 CustC 1 0 One
3 CustD 1 0 One
我们还可以将pd.crosstab
参数更广泛地使用margins=True
:
dfn = pd.crosstab(df['customer'], df['item'],
margins=True,
margins_name='Type').iloc[:-1].reset_index()
dfn['Type'] = dfn['Type'].map({2:'Both', 1:'One'})
item customer ProdA ProdB Type
0 CustA 1 1 Both
1 CustB 1 1 Both
2 CustC 1 0 One
3 CustD 1 0 One
答案 1 :(得分:2)
尝试使用set_index
,unstack
和np.select
:
df_out = df.set_index(['customer', 'item'])['qty'].unstack(fill_value=0)
SumProd = df_out['ProdA'] + df_out['ProdB']
df_out['Type'] = np.select([SumProd==2, SumProd==1, SumProd==0],['Both', 'One', 'None'])
print(df_out)
输出:
item ProdA ProdB Type
customer
CustA 1 1 Both
CustB 1 1 Both
CustC 1 0 One
CustD 1 0 One
答案 2 :(得分:0)
除了其他建议,您还可以完全跳过熊猫:
################################################################################
## Data ingestion
################################################################################
import csv
import StringIO
# Formated to make the example more straightforward.
input_data = StringIO.StringIO('''item,customer,qty
ProdA,CustA,1
ProdA,CustB,1
ProdA,CustC,1
ProdA,CustD,1
ProdB,CustA,1
ProdB,CustB,1
''')
records = []
reader = csv.DictReader(input_data)
for row in reader:
records.append(row)
################################################################################
## Data transformation.
## Makes a Dict-of-Dicts. Each inner Dict contains all data for a single
## customer.
################################################################################
products = {'ProdA', 'ProdB'}
customer_data = {}
for r in records:
customer_id = r['customer']
if not customer_id in customer_data:
customer_data[customer_id] = {}
customer_data[customer_id][r['item']] = int(r['qty'])
# Determines the customer type.
for c in customer_data:
c_data = customer_data[c]
missing_product = products.difference(c_data.keys())
matching_product = products.intersection(c_data.keys())
if missing_product:
for missing_p in missing_product:
c_data[missing_p] = 0
c_data['type'] = 'One'
else:
c_data['type'] = 'Both'
################################################################################
## Data display
################################################################################
for i, c in enumerate(customer_data):
if i == 0:
print('\t'.join(['ID'] + customer_data[c].keys()))
print('\t'.join([c] + [str(x) for x in customer_data[c].values()]))
哪个对我来说打印此
ID ProdA type ProdB
CustC 1 One 0
CustB 1 Both 1
CustA 1 Both 1
CustD 1 One 0