在Python中的名称之间分割值

时间:2020-11-01 20:32:02

标签: python dataframe python-regex

我有一个日期框架1

          Place     
  0       New York
  1       Los Angeles 1 
  2       Los Angeles- 2 
  3       Dallas -1
  4       Dallas - 2
  5       Dallas3

数据框2

Place          target    value1     value2
New York        1000       a          b
Los Angeles     1500       c          d
Dallas 1        2000       e          f

所需数据框

Place          target       value1     value2
New York        1000           a           b
Los Angeles 1   750            c           d
Los Angeles- 2  750            c           d
Dallas -1       666.6          e           f
Dallas - 2      666.6          e           f
Dallas3         666.6          e           f    

说明:我们必须在“位置”上合并dataframe1和dateframe2。 dataframe1中有1个纽约,2个洛杉矶,3个达拉斯,但dateframe2中只有一个。因此,我们根据df1中的位置计数(仅名称,而不是数字)划分目标,并将value1和value2分配给相应的位置。

是否可以使用正则表达式考虑所有拼写检查,空格,特殊字符并获得所需的数据框?

2 个答案:

答案 0 :(得分:0)

这是确切的解决方案:

def extract_city(col):
    return col.str.extract('([a-zA-Z]+(?:\s+[a-zA-Z]+)*)')[0]

df = pd.merge(df1, df2, left_on=extract_city(df1['Place']), right_on=extract_city(df2['Place']))

df = df.drop(['key_0', 'Place_y'], axis=1).rename({'Place_x' : 'Place'}, axis=1)

df['Target'] /= df.groupby(extract_city(df['Place']))['Place'].transform('count')

df

答案 1 :(得分:0)

执行此操作的另一种方法如下:

import pandas as pd
df1 = pd.DataFrame({'Place':['New York','Los Angeles 1','Los Angeles- 2','Dallas -1','Dallas - 2','Dallas3']})

print (df1)

#create a column to compare both dataframes. Remove numeric, - and space values
df1['Place_compare'] = df1.Place.str.replace('\d+|-| ', '')


df2 = pd.DataFrame({'Place':['New York','Los Angeles','Dallas 1'],
                    'target':[1000,1500,2000],
                    'value1':['a','c','e'],
                    'value2':['b','d','f']})

print (df2)

#create a column to compare both dataframes. Remove numeric, - and space values
df2['Place_compare'] = df2.Place.str.replace('\d+|-| ', '')

#count number of times the unique values of Place occurs in df1. assign to df2
df2['counts'] = df2['Place_compare'].map(df1['Place_compare'].value_counts())

#calculate new target based on number of occurrences of Place in df1
df2['new_target'] = (df2['target'] / df2['counts']).round(2)

#repeat the nows by the number of times it appears in counts
df2 = df2.reindex(df2.index.repeat(df2['counts']))

#drop temp columns
df2.drop(['counts','Place_compare','target'], axis=1, inplace=True)

#rename new_target as target
df2 = df2.rename({'new_target': 'target'}, axis=1)
print (df2)

其输出将是:

Dataframe1:

            Place
0        New York
1   Los Angeles 1
2  Los Angeles- 2
3       Dallas -1
4      Dallas - 2
5         Dallas3

Dataframe2:

         Place  target value1 value2
0     New York    1000      a      b
1  Los Angeles    1500      c      d
2     Dallas 1    2000      e      f

使用重复值更新的DataFrame:

         Place value1 value2  target
0     New York      a      b  1000.00
1  Los Angeles      c      d   750.00
1  Los Angeles      c      d   750.00
2     Dallas 1      e      f   666.67
2     Dallas 1      e      f   666.67
2     Dallas 1      e      f   666.67