正则表达式从熊猫数据框列中的字典匹配

时间:2020-06-05 08:59:39

标签: python pandas dictionary

我有一个数据框,其中包含下面“ product_location”列的数据。仅显示单行值作为参考

df.product_location[0]=[{'product':'christmas-socks-2019','store':'Downtown-A,Montgomery'}, {'product':'easter-socks-2018','store':'Euston'},{'product':'easter-socks-2019','source':'Euston'}]
df.product_location[1]=[{'product':'christmas-mugs-2019','store':'Montgomery'}, {'product':'easter-mugs-2018','store':'Euston, Downtown-B'},{'product':'easter-mugs-2019','source':'High-Street'}]
df.product_location[2]=[{'product':'christmas-card-2019','store':'Downtown-A, Montgomery'}, {'product':'easter-card-2018','store':'Euston'},{'product':'easter-card-2019','source':'Euston'}]
df.product_location[3]=[{'product':'christmas-chocolate-2019','store':'Downtown-A'}, {'product':'easter-chocolate-2018','store':'Euston'},{'product':'easter-chocolate-2017','source':'Euston'}]

我正在尝试从产品名称中正则表达式提取年份(例如2019、2018),并计算每种产品的商店数量,并得出计数最高的年份。

例如,对于第[0]行,我希望输出为2019,因为它的商店数量最多('Downtown-A,Montgomery,Euston')

预期的产量(如果没有一个年份的最高年份则为空白)

[0] '2019'
[1] (blank)
[2] '2019'
[3] (blank)

对数据框中的所有行执行此操作的最佳方法是什么?

1 个答案:

答案 0 :(得分:0)

将一行视为列表db

from collections import Counter

db=[{'product':'christmas-socks-2019','store':'Downtown-A,Montgomery'}, {'product':'easter-socks-2018','store':'Euston'},{'product':'easter-socks-2019','source':'Euston'}]

仅从词典列表中提取年份:

products = [int(d['product'].split('-')[-1]) for d in db]
counter = list(Counter(products).items())

根据您的情况

if counter[0][1] == 1:
    print('blank')
else:
    print(counter[0][0])

要遍历数据框并为每一行实现逻辑,可以尝试以下方法:

for i in range(len(df)):
    db = df.loc[i, 'product_location']
    products = [int(d['product'].split('-')[-1]) for d in db]
    counter = list(Counter(products).items())

    if counter[0][1] == 1:
        print('blank')
    else:
        print(counter[0][0])