根据其他列值编辑列值

时间:2016-01-20 21:26:55

标签: python pandas

我有一个pandas.DataFrame,我需要根据几列中的值来更新,其中包含所需列的值。 NAME被命名为别的,因为我知道这是不好的做法。这只是一个例子。

以下是我正在使用的示例:

import re
import pandas as pd

def anydigit(text):
    find_digit = re.search(r'\d+', text)
    if find_digit:
        return find_digit.start()
    else:
        return 0

df = pd.DataFrame({'DPID': ['A1', 'A2'], 'NAME': ['John Doe', 'Jane Doe'],
                   'ADDR_1': ['123 MAIN ST', 'ATTN: JOHN DOE'], 'ADDR_2': ['', 'P O BOX 123456']})
df['addr_ad1'] = df['ADDR_1'].apply(anydigit)
df['addr_ad2'] = df['ADDR_2'].apply(anydigit)
df['AUX_ADDR_LINE'] = ''

这是需要发生的事情。

If addr_ad1 == 0 and addr_ad2 > 0:
    aux_addr_line = addr_1
    addr_1 = addr_2
    addr_2 = ''
elfif addr_ad1 > 0 and re.sub(r'\s+', '', addr_2)[:4] == 'POBOX':
    aux_addr_line = ''
    addr_1 = addr_1
    addr_2 = ''
elif addr_ad2 > 0 and re.sub(r'\s+', '', addr_1)[:4] == 'POBOX':
    aux_addr_line = ''
    addr_1 = addr_2
    addr_2 = ''

我认为.apply()会起作用,但不确定我会怎么写。

1 个答案:

答案 0 :(得分:0)

调整了一些变量名称:

def anydigit(text):
    find_digit = re.search(r'\d+', text)
    if find_digit:
        return find_digit.start()
    else:
        return 0

df = pd.DataFrame({'DPID': ['A1', 'A2'], 'NAME': ['John Doe', 'Jane Doe'],
                   'addr_1': ['123 MAIN ST', 'ATTN: JOHN DOE'], 'addr_2': ['', 'P O BOX 123456']})
df['addr_ad1'] = df['addr_1'].apply(anydigit)
df['addr_ad2'] = df['addr_2'].apply(anydigit)
df['aux_addr_line'] = ''

开始于:

  DPID      NAME          addr_1          addr_2  addr_ad1  addr_ad2  \
0   A1  John Doe     123 MAIN ST                         0         0   
1   A2  Jane Doe  ATTN: JOHN DOE  P O BOX 123456         0         8   

  aux_addr_line  
0                
1               

为所有行定义一个函数apply

def change_address(row):
    if row.addr_ad1 == 0 and row.addr_ad2 > 0:
        row.aux_addr_line = row.addr_1
        row.addr_1 = row.addr_2
        row.addr_2 = ''
    elif row.addr_ad1 > 0 and re.sub(r'\s+', '', row.addr_2)[:4] == 'POBOX':
        row.aux_addr_line = ''
        row.addr_1 = row.addr_1
        row.addr_2 = ''
    elif row.addr_ad2 > 0 and re.sub(r'\s+', '', row.addr_1)[:4] == 'POBOX':
        row.aux_addr_line = ''
        row.addr_1 = row.addr_2
        row.addr_2 = ''
    return row

df = df.apply(change_address, axis=1)

得到:

  DPID      NAME          addr_1 addr_2  addr_ad1  addr_ad2   aux_addr_line
0   A1  John Doe     123 MAIN ST                0         0                
1   A2  Jane Doe  P O BOX 123456                0         8  ATTN: JOHN DOE