我想删除以下每个条目中的所有统计信息:
#ChangeColumnFullTimeGraduatesEmployedAtGraduation:74.3% #ChangeColumnAverageStartingSalaryAndBonus:$134,360 3.4 #ChangeColumnFullTimeGraduatesEmployedThreeMonthsAfterGraduation:81.4% #ChangeColumnPeerAssessmentScoreOutOf5.:4.3
#ChangeColumnFullTimeGraduatesEmployedAtGraduation:82.0% #ChangeColumnAverageStartingSalaryAndBonus:$127,368 3.29 #ChangeColumnFullTimeGraduatesEmployedThreeMonthsAfterGraduation:89.8% #ChangeColumnPeerAssessmentScoreOutOf5.:4.1
#ChangeColumnFullTimeGraduatesEmployedAtGraduation:80.7% #ChangeColumnAverageStartingSalaryAndBonus:$123,177 3.4 #ChangeColumnFullTimeGraduatesEmployedThreeMonthsAfterGraduation:92.5% #ChangeColumnPeerAssessmentScoreOutOf5.:4.0
我一直在尝试使用正则表达式(正则表达式)。基于所需的最终输出包含不超过数字百分号/ $符号的事实,这就是我拼凑在一起的内容:
import re
import csv
with(open('sheet.csv','rU')) as f:
for row in f:
re.sub([^0-9\$\%],'',row)
返回此语法错误:
re.sub([^0-9\$\%],'',row)
答案 0 :(得分:4)
正则从字符串解析正则表达式,使用字符串作为re.sub的参数,即
>>> re.sub(r'[^0-9\$\%]','',row)
或者你想要拆分:
>>> [c for c in re.split(r'[^0-9\$\%\.]',row) if c]
['74.3%', '$134', '360', '3.4', '81.4%', '5.', '4.3']
实际上它仍然不正确,因为您的列标签中有数字。如果您的输入看起来与您的示例完全相同,那么这样的事情可能会更好:
re.split(r'#[^:]+:|[ ,]',row)
'74.3%', '$134', '360', '3.4', '81.4%', '4.3'