当包含在列表中时从字符串中提取数据

时间:2019-10-28 16:11:01

标签: pandas list iteration nan string-parsing

我正在尝试对Pandas DataFrame的4列中包含的地理信息进行排序,以便将相同类型的行政分区始终存储在同一列中。

我建立了5个字符串列表,其中包含我想存储的5个地理区域的信息。

我尝试填充一致的列,将原始的4个不一致的列与我的5个一致的列表进行比较,但是原始列中存在nan值要么触发代码中的错误,要么在结果列中返回太多nan。下面,我提供了一个最小的代码示例。

import pandas as pd
df = pd.DataFrame (np.array([['nan', 'Rome', 'Civitavecchia'],
                             ['Asti', 'nan', 'Piedmont'],
                             ['Bozen', 'Sudtirol', 'nan']]),
 columns=['a','b','c'])


town = ['Civitavecchia']
province = ['Rome', 'Asti', 'Bozen']
region = ['Piedmont', 'Sudtirol']

#first attempt returns a ValueError: pattern contains no capture groups:
df['a'].str.extractall ('|'.join(town))#

#second attempt:
#this only yields two out of six not-nan results expected

df['geo1'] = np.where(df.a.isin(town), df.a, np.nan)
df['geo1'] = np.where(df.b.isin(town), df.b, np.nan)
df['geo1'] = np.where(df.c.isin(town), df.c, np.nan)

df['geo2'] = np.where(df.a.isin(province), df.a, np.nan)
df['geo2'] = np.where(df.b.isin(province), df.b, np.nan)
df['geo2'] = np.where(df.c.isin(province), df.c, np.nan)

df['geo3'] = np.where(df.a.isin(region), df.a, np.nan)
df['geo3'] = np.where(df.b.isin(region), df.b, np.nan)
df['geo3'] = np.where(df.c.isin(region), df.c, np.nan)


dftarget = pd.DataFrame (np.array([['Civitavecchia', 'Rome', 'nan'],
                             ['nan', 'Asti', 'Piedmont'],
                             ['nan', 'Bozen', 'Sudtirol']]),
 columns=['geo1','geo2','geo3'])

我想要的输出在dftarget中描述

2 个答案:

答案 0 :(得分:2)

IIUC,您可以堆叠数据,映射数据并进行透视:

# create a common mapping
d = {}
for t in town: d[t] = 'geo1'
for p in province: d[p] = 'geo2'
for r in region: d[r] = 'geo3'    

# stack data for one-go map
a = (df.stack().to_frame(name='data')
         .reset_index(level=1, drop=True)
    )

# return data
a.dropna().pivot(values='data', columns='col')

输出:

col           geo1   geo2      geo3
0    Civitavecchia   Rome       NaN
1              NaN   Asti  Piedmont
2              NaN  Bozen  Sudtirol

答案 1 :(得分:1)

使用f字符串格式尝试此方法。您需要在括号内定义捕获组。没有内部的paranthensis,您将获得没有捕获组定义的错误。

df['c'].str.extract(f'({"|".join(town)})')

输出:

               0
0  Civitavecchia
1            NaN
2            NaN