我有这样的数据:
foo = pd.DataFrame({'id': ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10'],
'amount': [10, 30, 40, 15, 20, 12, 55, 45, 60, 75],
'description': [u'LYFT SAN FRANCISCO CA', u'XYZ STARBUCKS MINNEAPOLIS MN', u'HOLIDAY BEMIDJI MN',
u'MCDONALDS MADISON WI', u'ABC SUPERAMERICA MI', u'SUBWAY ROCHESTER MN',
u'NNT BURGER KING WI', u'UBER TRIP CA', u'superamerica CA', u'AMAZON NY']})
foo:
id amount description
A1 10 LYFT SAN FRANCISCO CA
A2 30 XYZ STARBUCKS MINNEAPOLIS MN
A3 40 HOLIDAY BEMIDJI MN
A4 15 MCDONALDS MADISON WI
A5 20 ABC SUPERAMERICA MI
A6 12 SUBWAY ROCHESTER MN
A7 55 NNT BURGER KING WI
A8 45 UBER TRIP CA
A9 60 superamerica CA
A10 75 AMAZON NY
我想创建一个新列,该列根据description
列中的关键字匹配对每个记录进行分类。
我已经通过以下方式使用了this答案中的帮助:
import re
dict1 = {
"LYFT" : "cab_ride",
"UBER" : "cab_ride",
"STARBUCKS" : "Food",
"MCDONALDS" : "Food",
"SUBWAY" : "Food",
"BURGER KING" : "Food",
"HOLIDAY" : "Gas",
"SUPERAMERICA": "Gas"
}
def get_category_from_desc(x):
try:
return next(dict1[k] for k in dict1 if re.search(k, x, re.IGNORECASE))
except:
return "Other"
foo['category'] = foo.description.map(get_category_from_desc)
这可行,但是我想问一下这是否是解决此问题的最佳方法。由于我拥有更多可以指示类别的关键字,因此我必须创建一个庞大的字典:
dict1 = {
"STARBUCKS" : "Food",
"MCDONALDS" : "Food",
"SUBWAY" : "Food",
"BURGER KING" : "Food",
.
.
.
# ~50 more keys for "Food"
"HOLIDAY" : "Gas",
"SUPERAMERICA": "Gas",
.
.
.
# ~20 more keys for "Gas"
"WALMART" : "grocery",
"COSTCO": "grocery",
.
.
# ..... ~30 more keys for "grocery"
.
.
# ~ Many more categories with a large number of keys for each
}
编辑:我还想知道是否有一种方法不需要我创建一个如上所示的庞大词典。我可以使用较小的数据结构来实现这一点吗?
dict2 = {
"cab_ride" : ["LYFT", "UBER"], #....
"food" : ["STARBUCKS", "MCDONALDS", "SUBWAY", "BURGER KING"], #....
"gas" : ["HOLIDAY", "SUPERAMERICA"] #....
}
答案 0 :(得分:3)
我认为使用df.replace
和基于正则表达式的替换可以很容易地实现这一目标。然后,您可以使用df.where
处理“其他”案件。
dict2 = {rf'.*{k}.*': v for k, v in dict1.items()}
cats = foo['description'].replace(dict2, regex=True)
cats.where(cats != foo['description'], 'Other')
0 cab_ride
1 Food
2 Gas
3 Food
4 Gas
5 Food
6 Food
7 cab_ride
8 Other
9 Other
Name: description, dtype: object
另一种选择是将str.extract
与map
一起使用:
from collections import defaultdict
dict2 = defaultdict(lambda: 'Other')
dict2.update(dict1)
foo['description'].str.extract(rf"({'|'.join(dict1)})", expand=False).map(dict2)
0 cab_ride
1 Food
2 Gas
3 Food
4 Gas
5 Food
6 Food
7 cab_ride
8 Other
9 Other
Name: description, dtype: object
答案 1 :(得分:3)
您可以将.str
访问器与extract
一起使用,并在字典键上使用join
来编译正则表达式。
foo = pd.DataFrame({'id': ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10'],
'amount': [10, 30, 40, 15, 20, 12, 55, 45, 60, 75],
'description': [u'LYFT SAN FRANCISCO CA', u'XYZ STARBUCKS MINNEAPOLIS MN', u'HOLIDAY BEMIDJI MN',
u'MCDONALDS MADISON WI', u'ABC SUPERAMERICA MI', u'SUBWAY ROCHESTER MN',
u'NNT BURGER KING WI', u'UBER TRIP CA', u'superamerica CA', u'AMAZON NY']})
dict1 = {
"LYFT" : "cab_ride",
"UBER" : "cab_ride",
"STARBUCKS" : "Food",
"MCDONALDS" : "Food",
"SUBWAY" : "Food",
"BURGER KING" : "Food",
"HOLIDAY" : "Gas",
"SUPERAMERICA": "Gas"
}
regstr = '(' + '|'.join(dict1.keys()) + ')'
foo['category'] = foo['description'].str.extract(regstr).squeeze().map(dict1).fillna('Other')
print(foo)
输出:
id amount description category
0 A1 10 LYFT SAN FRANCISCO CA cab_ride
1 A2 30 XYZ STARBUCKS MINNEAPOLIS MN Food
2 A3 40 HOLIDAY BEMIDJI MN Gas
3 A4 15 MCDONALDS MADISON WI Food
4 A5 20 ABC SUPERAMERICA MI Gas
5 A6 12 SUBWAY ROCHESTER MN Food
6 A7 55 NNT BURGER KING WI Food
7 A8 45 UBER TRIP CA cab_ride
8 A9 60 superamerica CA Other
9 A10 75 AMAZON NY Other