我想找到在key1.str.endswith(key2)上合并2个数据帧的最佳方法,一个例子有时比单词更好:
i want to merge df1 and df2 on product.str.endswith(color)
df1:
index product
1 a208-BLACK
2 a2008-WHITE
3 x307-PEARL-WHITE
4 aa-b307-WHITE
df2:
index color code
1 BLACK X1001
2 WHITE X7005
3 PEARL-WHITE X7055
得到:
df:
index product code
1 a208-BLACK X1001
2 a2008-WHITE X7005
3 x307-PEARL-WHITE X7055
4 aa-b307-WHITE X7005
有什么想法吗?
答案 0 :(得分:2)
我不是正则表达式专家,最后一个是最棘手的人,但以下工作:
In [402]:
df['code'] = df['product'].str.split('-').str[1:].str.join('-').str.findall(r'[A-Z]+').str.join('-').map(df1.set_index('color')['code'])
df
Out[402]:
product code
index
1 a208-BLACK X1001
2 a2008-WHITE X7005
3 x307-PEARL-WHITE X7055
4 aa-b307-WHITE X7005
基本上我将产品代码拆分为-
,并将所有元素放在第一个破折号的右侧。
这留下了:
In [403]:
df['product'].str.split('-').str[1:]
Out[403]:
index
1 [BLACK]
2 [WHITE]
3 [PEARL, WHITE]
4 [b307, WHITE]
Name: product, dtype: object
然后我把破折号放回去,使用正则表达式只查找大写字母字符,这将处理最后一个字符,再次重新加入。
最后一位是在颜色列上设置索引后在另一个df上调用map,这将在df中执行颜色值的查找并返回相应的代码。
正则表达式并非万无一失,但它适用于您的数据集。
修改强>
我现在意识到我们不需要那么多连接:
In [409]:
df['code'] = df['product'].str.findall(r'[A-Z]+').str.join('-').map(df1.set_index('color')['code'])
df
Out[409]:
product code
index
1 a208-BLACK X1001
2 a2008-WHITE X7005
3 x307-PEARL-WHITE X7055
4 aa-b307-WHITE X7005
<强>计时强>
In [414]:
%%timeit
import re
df['color'] = df['product'].apply(lambda x: re.sub('^[^ALPHA:]*-(.*)', '\\1', x))
pd.merge(df, df1, on='color')
1 loops, best of 3: 4.09 ms per loop
In [416]:
%%timeit
df['code'] = df['product'].str.findall(r'[A-Z]+').str.join('-').map(df1.set_index('color')['code'])
100 loops, best of 3: 1.63 ms per loop
str方法比使用lambda快2倍,这可能不会令人惊讶,因为str
方法被调整为调用map
。
更新了时间
In [7]:
%%timeit
df1['color'] = df1['product'].str.extract(r'-([A-Z-]+)$')
pd.merge(df1, df2)
100 loops, best of 3: 4.51 ms per loop
In [9]:
%%timeit
df1['code'] = df1['product'].str.findall(r'[A-Z]+').str.join('-').map(df2.set_index('color')['code'])
100 loops, best of 3: 3.87 ms per loop
In [10]:
%%timeit
import re
df1['color'] = df1['product'].apply(lambda x: re.sub('^[^ALPHA:]*-(.*)', '\\1', x))
pd.merge(df1, df2, on='color')
100 loops, best of 3: 4.79 ms per loop
所以@ unutbu的答案比@colonel beaveau的答案略快,但在这里使用地图的速度更快。
事实上,如果我们将@ unutbu的正则表达式str
方法与地图结合起来,我们会比原始方法更快:
In [12]:
%%timeit
df1['product'].str.extract(r'-([A-Z-]+)$').map(df2.set_index('color')['code'])
100 loops, best of 3: 2.17 ms per loop
所以在这里使用map
比合并
答案 1 :(得分:1)
一些简洁的解决方案:
import pandas as pd
df1['color'] = df1['product'].apply(lambda x: re.sub('^[^ALPHA:]*-(.*)', '\\1', x))
pd.merge(df1, df2, on='color')
# product color code
#0 a208-BLACK BLACK X1001
#1 a2008-WHITE WHITE X7005
#2 x307-PEARL-WHITE PEARL-WHITE X7055
#3 aa-b307-WHITE WHITE X7005
答案 2 :(得分:1)
您可以使用vectorized string method,str.extract
和正则表达式r'-([A-Z-]+)$'
来查找颜色。
df1['color'] = df1['product'].str.findall(r'-([A-Z-]+)$').str[0]
然后pd.merge(df1, df2)
将合并到公共列(在本例中为color
列:
result = pd.merge(df1, df2)
例如,
import io
import pandas as pd
df1 = '''\
index product
1 a208-BLACK
2 a2008-WHITE
3 x307-PEARL-WHITE
4 aa-b307-WHITE'''
df1 = pd.read_table(io.BytesIO(df1), sep='\s+', index_col=0)
df2 = '''\
index color code
1 BLACK X1001
2 WHITE X7005
3 PEARL-WHITE X7055'''
df2 = pd.read_table(io.BytesIO(df2), sep='\s+', index_col=0)
df1['color'] = df1['product'].str.extract(r'-([A-Z-]+)$')
print(pd.merge(df1, df2))
产量
product color code
0 a208-BLACK BLACK X1001
1 a2008-WHITE WHITE X7005
2 aa-b307-WHITE WHITE X7005
3 x307-PEARL-WHITE PEARL-WHITE X7055
正则表达式模式r'-([A-Z-]+)$'
表示
- # match a literal hyphen
( # followed by a group
[A-Z-]+ # of 1-or-more capital letters or hyphens
) # end of group
$ # followed by end of line