我正在尝试对Pandas DataFrame的4列中包含的地理信息进行排序,以便将相同类型的行政分区始终存储在同一列中。
我建立了5个字符串列表,其中包含我想存储的5个地理区域的信息。
我尝试填充一致的列,将原始的4个不一致的列与我的5个一致的列表进行比较,但是原始列中存在nan值要么触发代码中的错误,要么在结果列中返回太多nan。下面,我提供了一个最小的代码示例。
import pandas as pd
df = pd.DataFrame (np.array([['nan', 'Rome', 'Civitavecchia'],
['Asti', 'nan', 'Piedmont'],
['Bozen', 'Sudtirol', 'nan']]),
columns=['a','b','c'])
town = ['Civitavecchia']
province = ['Rome', 'Asti', 'Bozen']
region = ['Piedmont', 'Sudtirol']
#first attempt returns a ValueError: pattern contains no capture groups:
df['a'].str.extractall ('|'.join(town))#
#second attempt:
#this only yields two out of six not-nan results expected
df['geo1'] = np.where(df.a.isin(town), df.a, np.nan)
df['geo1'] = np.where(df.b.isin(town), df.b, np.nan)
df['geo1'] = np.where(df.c.isin(town), df.c, np.nan)
df['geo2'] = np.where(df.a.isin(province), df.a, np.nan)
df['geo2'] = np.where(df.b.isin(province), df.b, np.nan)
df['geo2'] = np.where(df.c.isin(province), df.c, np.nan)
df['geo3'] = np.where(df.a.isin(region), df.a, np.nan)
df['geo3'] = np.where(df.b.isin(region), df.b, np.nan)
df['geo3'] = np.where(df.c.isin(region), df.c, np.nan)
dftarget = pd.DataFrame (np.array([['Civitavecchia', 'Rome', 'nan'],
['nan', 'Asti', 'Piedmont'],
['nan', 'Bozen', 'Sudtirol']]),
columns=['geo1','geo2','geo3'])
我想要的输出在dftarget中描述
答案 0 :(得分:2)
IIUC,您可以堆叠数据,映射数据并进行透视:
# create a common mapping
d = {}
for t in town: d[t] = 'geo1'
for p in province: d[p] = 'geo2'
for r in region: d[r] = 'geo3'
# stack data for one-go map
a = (df.stack().to_frame(name='data')
.reset_index(level=1, drop=True)
)
# return data
a.dropna().pivot(values='data', columns='col')
输出:
col geo1 geo2 geo3
0 Civitavecchia Rome NaN
1 NaN Asti Piedmont
2 NaN Bozen Sudtirol
答案 1 :(得分:1)
使用f字符串格式尝试此方法。您需要在括号内定义捕获组。没有内部的paranthensis,您将获得没有捕获组定义的错误。
df['c'].str.extract(f'({"|".join(town)})')
输出:
0
0 Civitavecchia
1 NaN
2 NaN