Question

我有一个包含银行信息的csv文件，我将其导入为pandas DataFrame。我想创建一个新列，其中包含从包含正则表达式的字典创建的事务类别（例如收入，费用，转移），以应用于事务描述。

例如，

import pandas as pd
import re

data = pd.read_csv("data/transactions.csv", parse_dates=['Date'])

以下是data DataFrame：

的输出

      Date                  Description   Amount
2016-01-01          checkcard good food   -12.45
2016-01-02              visa peppy lube   -30.34
2016-01-05           deposit bank of me  5000.00
2016-01-05             transfer to bank  2500.00
2016-01-10         gift from aunt sally    25.00

以下是正则表达式：

income = re.compile('.*deposit|gift.*')
expense = re.compile('good food|.*peppy lube.*')
transfer = re.compile('.*transfer.*')

这是字典：

catdict = {income: 'income',
           expense: 'expense',
           transfer: 'transfer'}

我想要创建一个名为Category的新列的代码，该列使用正则表达式将字典的值分配给Description列与其中一个正则表达式匹配的行，因此结果将是是：

      Date                  Description   Amount  Category
2016-01-01          checkcard good food   -12.45   expense 
2016-01-02              visa peppy lube   -30.34   expense
2016-01-05           deposit bank of me  5000.00    income
2016-01-05             transfer to bank  2500.00  transfer
2016-01-10         gift from aunt sally    25.00    income

理想情况下，此代码还会插入＆＃39; RECONCILE＆＃39;在行的类别列中，在正则表达式中找不到匹配项。

我是python的新手，怀疑有一种pythonic方法可以做到这一点我不知道。

提前致谢

Answer 1

您可以定义一个函数，根据您的regex将字符串（描述）映射到一个类别。第一次匹配模式时，该函数返回该类别的名称。它返回＆＃39; RECONCILE＆＃39;如果没有匹配。

from collections import OrderedDict

def category(s):

    catdict = OrderedDict([(income, 'income'),
                          (expense, 'expense'),
                          (transfer, 'transfer'),
                          ])
    for ptn, name in catdict.iteritems():
        if ptn.search(s):
            return name
    return 'RECONCILE'

然后，您可以将此功能应用于“说明”＆＃39;列。

data['Category'] = data.Description.map(category)

print data

这应该会给你你想要的东西。

使用字典中的正则表达式从事务列创建类别列

1 个答案: