我想根据正则表达式生成的结果在pandas数据框中创建一个新列。
我期待的结果是:
In[1]: df
Out[1]:
valueProduct valueService totValue
0 $465580.99 $322532.34 $788113.33
我的数据帧dtypes是:
df.dtypes
Contracting Office Name object
Contracting Office Region object
PIID object
PIID Agency ID object
Major Program object
Description of Requirement object
Referenced IDV PIID object
Completion Date datetime64[ns]
Prepared By object
Funding Office Name object
Funding Agency ID object
Funding Agency Name object
Funding Office ID object
Effective Date datetime64[ns]
Fiscal Year int64
Ultimate Contract Value float64
Count int64
标题为"要求说明"第1行中的长字符串值具有以下内容(此列中的数据集中的类似字符串值):
管理员增加额外的体积和道路工作变更银色滑动管理项目 - ALLEGHENY国家森林产品价值= $ 465580.99服务价值= 322532.34合同总价值= $ 788113.33
我想成功编写一个正则表达式从该字符串中提取3个项目,但只在新列中生成美元值:
VALUE OF PRODUCT = $465580.99
VALUE OF SERVICE = $322532.34
TOTAL VALUE OF CONTRACT = $788113.33
这里是执行此操作的代码,假设数据框中的字符串是数据帧之外的简单字符串值:
text = "STEWARDSHIP ADD ADDITIONAL VOLUME AND ROAD WORK CHANGES SILVER SLIDE STEWARDSHIP PROJECT - ALLEGHENY NATIONAL FOREST VALUE OF PRODUCT = $465580.99 VALUE OF SERVICE = $322532.34 TOTAL VALUE OF CONTRACT = $788113.33"
pattern = re.compile('(VALUE OF PRODUCT).{1,3}\$\d*\.\d*', re.IGNORECASE)
getPattern = re.search(pattern, text)
print (getPattern.group())
哪会产生:
VALUE OF PRODUCT = $465580.99
我可以对其他两个项重复此操作。
现在,感觉我在数据框架中工作,我尝试做类似以下的事情:
def valProduct(row):
pattern = re.compile('(VALUE OF PRODUCT).{1,3}\$\d*\.\d*', re.IGNORECASE)
findPattern = re.search(pattern, row['Description of Requirement'])
return findPatter
df['valueProduct'] = df.apply(lambda row: valProduct(row), axis=1)
In[2]: sf[['valueProduct']][:1]
Out[2]: None
这会生成一个新列,但它是空的,但至少应该显示:
VALUE OF PRODUCT = $465580.99
非常感谢任何帮助!
答案 0 :(得分:1)
import re
text = "STEWARDSHIP ADD ADDITIONAL VOLUME AND ROAD WORK CHANGES SILVER SLIDE STEWARDSHIP PROJECT - ALLEGHENY NATIONAL FOREST VALUE OF PRODUCT = $465580.99 VALUE OF SERVICE = $322532.34 TOTAL VALUE OF CONTRACT = $788113.33"
re.findall(r'value.+?\d\b',text, re.I)
输出
['VALUE OF PRODUCT = $465580', 'VALUE OF SERVICE = $322532', 'VALUE OF CONTRACT = $788113']