重塑数据框并根据条件对值进行计数

时间:2019-12-18 20:19:21

标签: python pandas

我在下面设置了数据。我正在尝试通过提供标签来确定客户类型。尝试时由于数据过多,我的excel崩溃了,所以尝试用Python完成。

item  customer qty
------------------
ProdA CustA    1 
ProdA CustB    1
ProdA CustC    1
ProdA CustD    1
ProdB CustA    1
ProdB CustB    1

在Excel中,我会:

1. Create new columns "ProdA", "ProdB", "Type"
2. Remove duplicates for column "customer"
3. COUNTIF Customer = ProdA, COUNTIF customer = ProdB
4. IF(AND(ProdA = 1, ProdB = 1), "Both", "One")


customer ProdA ProdB Type
--------------------------
CustA    1     1     Both
CustB    1     1     Both
CustC    1     0     One
CustD    1     0     One

3 个答案:

答案 0 :(得分:2)

方法1:

我们可以使用pd.crosstab,然后使用ProdAProdBSeries.map 2 -> Both1 -> One的总和来实现: / p>

dfn = pd.crosstab(df['customer'], df['item']).reset_index()
dfn['Type'] = dfn[['ProdA', 'ProdB']].sum(axis=1).map({2:'Both', 1:'One'})

或者我们可以在最后一行使用np.where有条件地分配BothOne

dfn['Type'] = np.where(dfn['ProdA'].eq(1) & dfn['ProdB'].eq(1), 'Both', 'One')
item customer  ProdA  ProdB  Type
0       CustA      1      1  Both
1       CustB      1      1  Both
2       CustC      1      0   One
3       CustD      1      0   One

方法2

我们还可以将pd.crosstab参数更广泛地使用margins=True

dfn = pd.crosstab(df['customer'], df['item'], 
                  margins=True, 
                  margins_name='Type').iloc[:-1].reset_index()

dfn['Type'] = dfn['Type'].map({2:'Both', 1:'One'})
item customer  ProdA  ProdB  Type
0       CustA      1      1  Both
1       CustB      1      1  Both
2       CustC      1      0   One
3       CustD      1      0   One

答案 1 :(得分:2)

尝试使用set_indexunstacknp.select

df_out = df.set_index(['customer', 'item'])['qty'].unstack(fill_value=0)
SumProd = df_out['ProdA'] + df_out['ProdB']
df_out['Type'] = np.select([SumProd==2, SumProd==1, SumProd==0],['Both', 'One', 'None'])
print(df_out)

输出:

item      ProdA  ProdB  Type
customer                    
CustA         1      1  Both
CustB         1      1  Both
CustC         1      0   One
CustD         1      0   One

答案 2 :(得分:0)

除了其他建议,您还可以完全跳过熊猫:

################################################################################
## Data ingestion
################################################################################
import csv
import StringIO

# Formated to make the example more straightforward.
input_data = StringIO.StringIO('''item,customer,qty
ProdA,CustA,1
ProdA,CustB,1
ProdA,CustC,1
ProdA,CustD,1
ProdB,CustA,1
ProdB,CustB,1
''')

records = []
reader = csv.DictReader(input_data)
for row in reader:
  records.append(row)

################################################################################
## Data transformation.
## Makes a Dict-of-Dicts. Each inner Dict contains all data for a single
## customer. 
################################################################################
products = {'ProdA', 'ProdB'}
customer_data = {}

for r in records:
  customer_id = r['customer']
  if not customer_id in customer_data:
    customer_data[customer_id] = {}
  customer_data[customer_id][r['item']] = int(r['qty'])

# Determines the customer type. 
for c in customer_data:
  c_data = customer_data[c]
  missing_product = products.difference(c_data.keys())
  matching_product = products.intersection(c_data.keys())
  if missing_product:
    for missing_p in missing_product:
      c_data[missing_p] = 0
    c_data['type'] = 'One'
  else:
    c_data['type'] = 'Both'

################################################################################
## Data display
################################################################################
for i, c in enumerate(customer_data):
  if i == 0:
    print('\t'.join(['ID'] + customer_data[c].keys()))
  print('\t'.join([c] + [str(x) for x in customer_data[c].values()]))

哪个对我来说打印此

ID      ProdA   type    ProdB
CustC   1       One     0
CustB   1       Both    1
CustA   1       Both    1
CustD   1       One     0