Question

我有3个字段1 ：）发票编号2 ：）发票子编号和3 ：）发票金额。每个唯一的发票编号可能具有多个发票子编号。要求是对于多行中的每个唯一发票编号，如果发票子编号以1200和2100开头，则应引入一个虚拟列，其中将显示“ 1200和2100都存在”，否则，如果行中有发票子编号从1200开始，虚拟列应为“只有1200”，否则应为“只有2100”。例子在下面

S.no Invoice #    Invoice Sub Number    Amount    Dummy
----------------------------------------------
 1.   1234              1230             $100  Both 2100 and 1200 exists
 2.   1234              2100             $100  Both 2100 and 1200 exists
 3.   1234              1200             $100  Both 2100 and 1200 exists
 4.   1245              5430             $50   Only 1200 exists 1245      
 5.   1245              1200             $80   Only 1200 exists

我在python中尝试了以下命令，但无法正常运行，需要同一命令的帮助使用的命令

df1= df
df1['Invoice #'] = df1['Invoice #'].astype(object)
df['Invoice sub Number'] = df['Invoice sub Number'].astype(str)
df1= df1.groupby(df['Invoice sub Number','Invoice #'].size().groupby(level=0).size())

df1['dummy']= np.where(df1['Invoice sub Number'].str.startswith ('1200'),'Contains 1200 only',
               np.where(df1['Invoice sub Number'].str.startswith ('2100'),'Contains 2100 only',
                        np.where((df1['Invoice sub Number'].str.startswith ('1200'))&(df1['Invoice sub Number'].str.startswith ('2100')),
                                 'Contains both 1200 and 2100','Contains neither 1200 nor 2100')))

我得到的错误是：-KeyError: ('Invoice sub Number', 'Invoice #')

Answer 1

我建议将GroupBy.any与transform一起使用，以检查每组至少True，然后按条件numpy.select进行列：

使用：

print (df)
    Invoice #  Invoice sub Number  Amount
0         123                1234     100
1         123                2345     200
2         123                3456     300
3         123                1200     400
4         123                2100     500
5        1234                1245     600
6        1234                2344     700
7        1234                1200     800
8        2345                 345     900
9        2345                2100    1000
10       2345                2458    1100
11       6789                2345    1200
12       6789                3421    1300
13       6789                1234    1400

m1 = df['Invoice sub Number'].astype(str).str.startswith('1200')    
m2 = df['Invoice sub Number'].astype(str).str.startswith('2100')

m11 = m1.groupby(df['Invoice #']).transform('any')
m22 = m2.groupby(df['Invoice #']).transform('any')

masks =[ m11 & m22 , m11, m22]
vals = ['Contains both 1200 and 2100', 'Contains 1200 only','Contains 2100 only']
default = 'Contains neither 1200 nor 2100'         

df['dummy'] = np.select(masks, vals, default=default)

print (df)
    Invoice #  Invoice sub Number  Amount                           dummy
0         123                1234     100     Contains both 1200 and 2100
1         123                2345     200     Contains both 1200 and 2100
2         123                3456     300     Contains both 1200 and 2100
3         123                1200     400     Contains both 1200 and 2100
4         123                2100     500     Contains both 1200 and 2100
5        1234                1245     600              Contains 1200 only
6        1234                2344     700              Contains 1200 only
7        1234                1200     800              Contains 1200 only
8        2345                 345     900              Contains 2100 only
9        2345                2100    1000              Contains 2100 only
10       2345                2458    1100              Contains 2100 only
11       6789                2345    1200  Contains neither 1200 nor 2100
12       6789                3421    1300  Contains neither 1200 nor 2100
13       6789                1234    1400  Contains neither 1200 nor 2100

在多行中为python

1 个答案: