假设我有一个包含品牌名称的品牌名单:
BRANDS = ['Samsung', 'Apple', 'Nike', .....]
Dataframe A具有以下结构
row item_title brand_name
1 | Apple 6S | Apple
2 | Nike BB Shoes | na <-- need to fill with Nike
3 | Samsung TV | na <--need fill with Samsung
4 | Used bike | na <--No need to do anything because there is no brand_name in the title
....
我想用Nike填充第2行的brand_name列,使用Samsung填充第3行,因为它们为null,而item_title包含可在列表BRANDS中找到的关键字。我该怎么办?
答案 0 :(得分:3)
矢量化解决方案:
In [168]: x = df.item_title.str.split(expand=True)
In [169]: df['brand_name'] = \
df['brand_name'].fillna(x[x.isin(BRANDS)]
.ffill(axis=1)
.bfill(axis=1)
.iloc[:, 0])
In [170]: df
Out[170]:
row item_title brand_name
0 1 Apple 6S Apple
1 2 Nike BB Shoes Nike
2 3 Samsung TV Samsung
3 4 Used bike NaN
答案 1 :(得分:1)
一种方法是使用apply()
:
import pandas as pd
BRANDS = ['Samsung', 'Apple', 'Nike']
def get_brand_name(row):
if ~pd.isnull(row['brand_name']):
# don't do anything if brand_name is not null
return row['brand_name']
item_title = row['item_title']
title_words = map(str.title, item_title.split())
for tw in title_words:
if tw in BRANDS:
# return first 'match'
return tw
# default return None
return None
df['brand_name'] = df.apply(lambda x: get_brand_name(x), axis=1)
print(df)
# row item_title brand_name
#0 1 Apple 6S Apple
#1 2 Nike BB Shoes Nike
#2 3 Samsung TV Samsung
#3 4 Used bike None
备注强>
set
代替list
,因为查询会更快。但是,如果您关心订单,这将不会奏效。答案 2 :(得分:0)
您可以通过编写一个简单的函数来获得所需的结果。然后,您可以将.apply()
与lambda function
结合使用来生成所需的列。
def contains_any(s, arr):
for item in arr:
if item in s: return item
return np.nan
df['brand_name'] = df['product'].apply(lambda x: match_substring(x, product_map))