如何优化替换列和索引的代码?

时间:2017-07-01 19:29:34

标签: python regex pandas dataframe data-cleaning

census_subdivision_profile_merged是一个数据框,在这里我分别做很多事情。有没有办法可以一次完成所有这些工作?

# Drop missing data
census_subdivision_profile_merged = census_subdivision_profile_merged.dropna()
census_subdivision_profile_merged.columns = [census_subdivision_profile_merged.columns[i]+' '+str(i) for i in range(len(census_subdivision_profile_merged.columns))]
census_subdivision_profile_merged.columns = [census_subdivision_profile_merged.columns[i].replace(" ", "_") for i in range(len(census_subdivision_profile_merged.columns))]
census_subdivision_profile_merged.columns = [census_subdivision_profile_merged.columns[i].replace(",", "_") for i in range(len(census_subdivision_profile_merged.columns))]
census_subdivision_profile_merged.columns = [census_subdivision_profile_merged.columns[i].replace("-", "_") for i in range(len(census_subdivision_profile_merged.columns))]
census_subdivision_profile_merged.columns = [census_subdivision_profile_merged.columns[i].replace("%", "_") for i in range(len(census_subdivision_profile_merged.columns))]
census_subdivision_profile_merged.columns = [census_subdivision_profile_merged.columns[i].replace("$", "_") for i in range(len(census_subdivision_profile_merged.columns))]

1 个答案:

答案 0 :(得分:0)

您使用字符串执行5次替换方法,但您也可以使用正则表达式:

import re

#Test data frame
df=pd.DataFrame({"data1":["E %,-$p,e","E    $m$$-%ple"],"data2":["E %,-$p,e","E    $m$$-%ple"]})



#Remove all special characters and whitespaces for each row, for each word
for j in df.columns:
    for strs in range(len(df[j])):
        df.loc[strs,j]=re.sub(r'[-%,$\s]',"_",df.loc[strs,j])


print(df)

对于你的例子,这样的smth应该有效:

for  j in census_subdivision_profile_merged.columns:
    for strs in range(len(census_subdivision_profile_merged[j])):
        census_subdivision_profile_merged.loc[strs, j] = re.sub(r'[-%,$\s]', "_", census_subdivision_profile_merged.loc[strs, j])

或者您可以尝试使用此功能删除所有特殊字符以及完整DataFrame中的空格。

census_subdivision_profile_merged = census_subdivision_profile_merged.replace(r"[-%,$\s]","_",regex=True)