我有两个数据集 一世)。 SanFransisco工资数据 ii)。伪造的亚马逊订单数据
现在我有一个疑问,因为在一个数据集中,我的条件选择逻辑有效,但在其他数据集中,我必须使用另一逻辑。
import pandas as pd
sal=pd.read_csv('Salary.csv')
sal.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
Id 148654 non-null int64
EmployeeName 148654 non-null object
JobTitle 148654 non-null object
BasePay 148045 non-null float64
OvertimePay 148650 non-null float64
OtherPay 148650 non-null float64
Benefits 112491 non-null float64
TotalPay 148654 non-null float64
TotalPayBenefits 148654 non-null float64
Year 148654 non-null int64
Notes 0 non-null float64
Agency 148654 non-null object
Status 0 non-null float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB
问题:2013年仅一个人代表多少个职位? (例如,2013年只出现过一次职位名称?)
len(sal[(sal['Year']==2013) & (sal['JobTitle'].value_counts()==1)])
这不起作用。输出0,应输出202
sum(sal[sal['Year']==2013]['JobTitle'].value_counts()==1)
这正在工作。但这不是直观的。
import pandas as pd
ecom=pd.read_csv('EcommercePurchases.csv')
ecom.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
Address 10000 non-null object
Lot 10000 non-null object
AM or PM 10000 non-null object
Browser Info 10000 non-null object
Company 10000 non-null object
Credit Card 10000 non-null int64
CC Exp Date 10000 non-null object
CC Security Code 10000 non-null int64
CC Provider 10000 non-null object
Email 10000 non-null object
Job 10000 non-null object
IP Address 10000 non-null object
Language 10000 non-null object
Purchase Price 10000 non-null float64
dtypes: float64(1), int64(2), object(11)
memory usage: 1.1+ MB
问题:有多少人以美国运通卡作为他们的信用卡提供商,并且购物金额超过95美元?
len(ecom[(ecom['CC Provider']=='American Express') & (ecom['Purchase Price'] >95)])
这在这里提供了完美的输出。我想知道为什么在上述情况下它不起作用。
答案 0 :(得分:0)
我认为需要GroupBy.transform
和size
来获得与原始DataFrame
相同大小的退货系列:
out = ((sal['Year']==2013) & (sal.groupby(['JobTitle','Year'])['JobTitle'].transform('size')==1)).sum()
#count exclude NaN of JobTitle if exist
out = ((sal['Year']==2013) & (sal.groupby(['JobTitle','Year'])['JobTitle'].transform('count')==1)).sum()
用duplicated
代替每列的所有重复项,~
反转booelan掩码:
out = ((sal['Year']==2013) & ~(sal.duplicated(subset=['Year','JobTitle'], keep=False))).sum()
示例:
sal = pd.DataFrame({'JobTitle':list('abccbbd'),
'Year':[2012] + [2013] * 6})
print (sal)
JobTitle Year
0 a 2012
1 b 2013
2 c 2013
3 c 2013
4 b 2013
5 b 2013
6 d 2013
print (sal.groupby(['JobTitle','Year'])['JobTitle'].transform('size'))
0 1
1 3
2 2
3 2
4 3
5 3
6 1
Name: JobTitle, dtype: int64
out = ((sal['Year']==2013) & (sal.groupby(['JobTitle','Year'])['JobTitle'].transform('size')==1)).sum()
print (out)
1