pandas函数中的正则表达式

时间:2017-03-16 03:01:07

标签: python regex pandas

我想根据正则表达式生成的结果在pandas数据框中创建一个新列。

我期待的结果是:

In[1]: df
Out[1]: 

    valueProduct    valueService      totValue
0     $465580.99      $322532.34    $788113.33

我的数据帧dtypes是:

df.dtypes

Contracting Office Name               object
Contracting Office Region             object
PIID                                  object
PIID Agency ID                        object
Major Program                         object
Description of Requirement            object
Referenced  IDV PIID                  object
Completion Date               datetime64[ns]
Prepared By                           object
Funding Office Name                   object
Funding Agency ID                     object
Funding Agency Name                   object
Funding Office ID                     object
Effective Date                datetime64[ns]
Fiscal Year                            int64
Ultimate Contract Value              float64
Count                                  int64

标题为"要求说明"第1行中的长字符串值具有以下内容(此列中的数据集中的类似字符串值):

管理员增加额外的体积和道路工作变更银色滑动管理项目 - ALLEGHENY国家森林产品价值= $ 465580.99服务价值= 322532.34合同总价值= $ 788113.33

我想成功编写一个正则表达式从该字符串中提取3个项目,但只在新列中生成美元值:

VALUE OF PRODUCT = $465580.99
VALUE OF SERVICE = $322532.34
TOTAL VALUE OF CONTRACT = $788113.33

这里是执行此操作的代码,假设数据框中的字符串是数据帧之外的简单字符串值:

text = "STEWARDSHIP ADD ADDITIONAL VOLUME AND ROAD WORK CHANGES SILVER SLIDE STEWARDSHIP PROJECT - ALLEGHENY NATIONAL FOREST VALUE OF PRODUCT = $465580.99 VALUE OF SERVICE = $322532.34 TOTAL VALUE OF CONTRACT = $788113.33"


pattern = re.compile('(VALUE OF PRODUCT).{1,3}\$\d*\.\d*', re.IGNORECASE)
getPattern = re.search(pattern, text)
print (getPattern.group())

哪会产生:

VALUE OF PRODUCT = $465580.99

我可以对其他两个项重复此操作。

现在,感觉我在数据框架中工作,我尝试做类似以下的事情:

def valProduct(row):
    pattern = re.compile('(VALUE OF PRODUCT).{1,3}\$\d*\.\d*', re.IGNORECASE)
    findPattern = re.search(pattern, row['Description of Requirement'])
    return findPatter

df['valueProduct'] = df.apply(lambda row: valProduct(row), axis=1)

In[2]: sf[['valueProduct']][:1]
Out[2]:  None

这会生成一个新列,但它是空的,但至少应该显示:

VALUE OF PRODUCT = $465580.99

非常感谢任何帮助!

1 个答案:

答案 0 :(得分:1)

import re    

text = "STEWARDSHIP ADD ADDITIONAL VOLUME AND ROAD WORK CHANGES SILVER SLIDE STEWARDSHIP PROJECT - ALLEGHENY NATIONAL FOREST VALUE OF PRODUCT = $465580.99 VALUE OF SERVICE = $322532.34 TOTAL VALUE OF CONTRACT = $788113.33"

re.findall(r'value.+?\d\b',text, re.I)

输出

['VALUE OF PRODUCT = $465580', 'VALUE OF SERVICE = $322532', 'VALUE OF CONTRACT = $788113']