我有3个字段1 :)发票编号2 :)发票子编号和3 :)发票金额。每个唯一的发票编号可能具有多个发票子编号。要求是对于多行中的每个唯一发票编号,如果发票子编号以1200和2100开头,则应引入一个虚拟列,其中将显示“ 1200和2100都存在”,否则,如果行中有发票子编号从1200开始,虚拟列应为“只有1200”,否则应为“只有2100”。例子在下面
S.no Invoice # Invoice Sub Number Amount Dummy
----------------------------------------------
1. 1234 1230 $100 Both 2100 and 1200 exists
2. 1234 2100 $100 Both 2100 and 1200 exists
3. 1234 1200 $100 Both 2100 and 1200 exists
4. 1245 5430 $50 Only 1200 exists 1245
5. 1245 1200 $80 Only 1200 exists
我在python中尝试了以下命令,但无法正常运行,需要同一命令的帮助 使用的命令
df1= df
df1['Invoice #'] = df1['Invoice #'].astype(object)
df['Invoice sub Number'] = df['Invoice sub Number'].astype(str)
df1= df1.groupby(df['Invoice sub Number','Invoice #'].size().groupby(level=0).size())
df1['dummy']= np.where(df1['Invoice sub Number'].str.startswith ('1200'),'Contains 1200 only',
np.where(df1['Invoice sub Number'].str.startswith ('2100'),'Contains 2100 only',
np.where((df1['Invoice sub Number'].str.startswith ('1200'))&(df1['Invoice sub Number'].str.startswith ('2100')),
'Contains both 1200 and 2100','Contains neither 1200 nor 2100')))
我得到的错误是:-KeyError: ('Invoice sub Number', 'Invoice #')
答案 0 :(得分:0)
我建议将GroupBy.any
与transform
一起使用,以检查每组至少True
,然后按条件numpy.select
进行列:
使用:
print (df)
Invoice # Invoice sub Number Amount
0 123 1234 100
1 123 2345 200
2 123 3456 300
3 123 1200 400
4 123 2100 500
5 1234 1245 600
6 1234 2344 700
7 1234 1200 800
8 2345 345 900
9 2345 2100 1000
10 2345 2458 1100
11 6789 2345 1200
12 6789 3421 1300
13 6789 1234 1400
m1 = df['Invoice sub Number'].astype(str).str.startswith('1200')
m2 = df['Invoice sub Number'].astype(str).str.startswith('2100')
m11 = m1.groupby(df['Invoice #']).transform('any')
m22 = m2.groupby(df['Invoice #']).transform('any')
masks =[ m11 & m22 , m11, m22]
vals = ['Contains both 1200 and 2100', 'Contains 1200 only','Contains 2100 only']
default = 'Contains neither 1200 nor 2100'
df['dummy'] = np.select(masks, vals, default=default)
print (df)
Invoice # Invoice sub Number Amount dummy
0 123 1234 100 Contains both 1200 and 2100
1 123 2345 200 Contains both 1200 and 2100
2 123 3456 300 Contains both 1200 and 2100
3 123 1200 400 Contains both 1200 and 2100
4 123 2100 500 Contains both 1200 and 2100
5 1234 1245 600 Contains 1200 only
6 1234 2344 700 Contains 1200 only
7 1234 1200 800 Contains 1200 only
8 2345 345 900 Contains 2100 only
9 2345 2100 1000 Contains 2100 only
10 2345 2458 1100 Contains 2100 only
11 6789 2345 1200 Contains neither 1200 nor 2100
12 6789 3421 1300 Contains neither 1200 nor 2100
13 6789 1234 1400 Contains neither 1200 nor 2100