我有一个pandas.DataFrame
,我需要根据几列中的值来更新,其中包含所需列的值。 NAME被命名为别的,因为我知道这是不好的做法。这只是一个例子。
以下是我正在使用的示例:
import re
import pandas as pd
def anydigit(text):
find_digit = re.search(r'\d+', text)
if find_digit:
return find_digit.start()
else:
return 0
df = pd.DataFrame({'DPID': ['A1', 'A2'], 'NAME': ['John Doe', 'Jane Doe'],
'ADDR_1': ['123 MAIN ST', 'ATTN: JOHN DOE'], 'ADDR_2': ['', 'P O BOX 123456']})
df['addr_ad1'] = df['ADDR_1'].apply(anydigit)
df['addr_ad2'] = df['ADDR_2'].apply(anydigit)
df['AUX_ADDR_LINE'] = ''
这是需要发生的事情。
If addr_ad1 == 0 and addr_ad2 > 0:
aux_addr_line = addr_1
addr_1 = addr_2
addr_2 = ''
elfif addr_ad1 > 0 and re.sub(r'\s+', '', addr_2)[:4] == 'POBOX':
aux_addr_line = ''
addr_1 = addr_1
addr_2 = ''
elif addr_ad2 > 0 and re.sub(r'\s+', '', addr_1)[:4] == 'POBOX':
aux_addr_line = ''
addr_1 = addr_2
addr_2 = ''
我认为.apply()
会起作用,但不确定我会怎么写。
答案 0 :(得分:0)
调整了一些变量名称:
def anydigit(text):
find_digit = re.search(r'\d+', text)
if find_digit:
return find_digit.start()
else:
return 0
df = pd.DataFrame({'DPID': ['A1', 'A2'], 'NAME': ['John Doe', 'Jane Doe'],
'addr_1': ['123 MAIN ST', 'ATTN: JOHN DOE'], 'addr_2': ['', 'P O BOX 123456']})
df['addr_ad1'] = df['addr_1'].apply(anydigit)
df['addr_ad2'] = df['addr_2'].apply(anydigit)
df['aux_addr_line'] = ''
开始于:
DPID NAME addr_1 addr_2 addr_ad1 addr_ad2 \
0 A1 John Doe 123 MAIN ST 0 0
1 A2 Jane Doe ATTN: JOHN DOE P O BOX 123456 0 8
aux_addr_line
0
1
为所有行定义一个函数apply
:
def change_address(row):
if row.addr_ad1 == 0 and row.addr_ad2 > 0:
row.aux_addr_line = row.addr_1
row.addr_1 = row.addr_2
row.addr_2 = ''
elif row.addr_ad1 > 0 and re.sub(r'\s+', '', row.addr_2)[:4] == 'POBOX':
row.aux_addr_line = ''
row.addr_1 = row.addr_1
row.addr_2 = ''
elif row.addr_ad2 > 0 and re.sub(r'\s+', '', row.addr_1)[:4] == 'POBOX':
row.aux_addr_line = ''
row.addr_1 = row.addr_2
row.addr_2 = ''
return row
df = df.apply(change_address, axis=1)
得到:
DPID NAME addr_1 addr_2 addr_ad1 addr_ad2 aux_addr_line
0 A1 John Doe 123 MAIN ST 0 0
1 A2 Jane Doe P O BOX 123456 0 8 ATTN: JOHN DOE