我是熊猫的新手,并试图从kaggle.com“ SF Salaries”下载的Salaries.csv文件中找到平均BasePay的平均值。但是多余的逗号出现在“,” 由于默认字段分隔符为“,”,因此JobTitle字段中的ID(例如ID 5)似乎会造成问题。
Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,
我目前看到的一种方法是编辑文件,用空格或“ |”替换逗号sed
sed 's/\(\"[^",]\{1,\}\),\([^",]\{1,\}\"\)/\1 | \2/g'
并使用
sal=pd.read_csv('/Users/Downloads/Salaries.csv')
sal['BasePay'].mean()
熊猫还有其他方法可以清除此类数据吗?
答案 0 :(得分:0)
使用小功能消除字段中不需要的逗号
data = pd.read_csv("Salaries.csv")
data.head()
def remove_comma(text):
text = "".join([filtered_text for filtered_text in text if filtered_text != ","])
return text
data["JobTitle"] = data["JobTitle"].apply(lambda x: remove_comma(x))
由于数据集中的BasePay列包含字符串值,因此首选将“未提供”值替换为0.00并转换为float以进行平均操作
data["BasePay"] = data["BasePay"].replace("Not Provided","0.00").astype("float64")
data["BasePay"].mean()