解析此CSV文件并创建而不使用熊猫吗?

时间:2019-05-29 21:52:45

标签: python pandas dataframe

我能够弄清楚如何使用Pandas进行此操作,但是如果没有,我会完全迷失:我得到了两个CSV文件:

order_products:

order_id,product_id,add_to_cart_order,reordered
2,33120,1,1
2,28985,2,1
2,9327,3,0
2,45918,4,1
3,17668,1,1
3,46667,2,1
3,17461,4,1
3,32665,3,1
4,46842,1,0

产品:

product_id,product_name,aisle_id,department_id
9327,Garlic Powder,104,13
17461,Air Chilled Organic Boneless Skinless Chicken Breasts,35,12
17668,Unsweetened Chocolate Almond Breeze Almond Milk,91,16
28985,Michigan Organic Kale,83,4
32665,Organic Ezekiel 49 Bread Cinnamon Raisin,112,3
33120,Organic Egg Whites,86,16
45918,Coconut Butter,19,13
46667,Organic Ginger Root,83,4
46842,Plain Pre-Sliced Bagels,93,3

然后我要创建一个新表,该表列出每个部门,为该部门创建的订单数,第一笔订单数以及该部门的比率(第一笔订单数/订单数)

所以结果表如下:

department_id,number_of_orders,number_of_first_orders,percentage
3,2,1,0.50
4,2,0,0.00
12,1,0,0.00
13,2,1,0.50
16,2,0,0.00

我的解决方案涉及熊猫:

orders = pd.read_csv("../insight_testsuite/tests/test_1/input/order_products.csv")
products = pd.read_csv("../insight_testsuite/tests/test_1/input/products.csv")

orders.drop(['add_to_cart_order'], axis=1, inplace=True)
products.drop(['aisle_id', 'product_name'], axis=1, inplace=True)

dep = pd.merge(orders, products)

dep = (dep.groupby('department_id')['reordered']
         .agg([('number_of_orders','size'), 
               ('number_of_first_orders', lambda x: x.eq(0).sum())
               ])
         .reset_index())

dep['percentage'] = ("%.2f" % 
   round((dep['number_of_first_orders'] / dep['number_of_orders']), 2))

但是使用常规Python AFAIK,您只能逐行浏览CSV文件并从那里进行评估。因此,我不确定在不使用Pandas的情况下如何执行这种分析。

1 个答案:

答案 0 :(得分:1)

好吧,可以。只需大量工作:

from collections import defaultdict
import pandas as pd

s1 = '''order_id,product_id,add_to_cart_order,reordered
2,33120,1,1
2,28985,2,1
2,9327,3,0
2,45918,4,1
3,17668,1,1
3,46667,2,1
3,17461,4,1
3,32665,3,1
4,46842,1,0'''
s2 = '''product_id,product_name,aisle_id,department_id
9327,Garlic Powder,104,13
17461,Air Chilled Organic Boneless Skinless Chicken Breasts,35,12
17668,Unsweetened Chocolate Almond Breeze Almond Milk,91,16
28985,Michigan Organic Kale,83,4
32665,Organic Ezekiel 49 Bread Cinnamon Raisin,112,3
33120,Organic Egg Whites,86,16
45918,Coconut Butter,19,13
46667,Organic Ginger Root,83,4
46842,Plain Pre-Sliced Bagels,93,3'''
result = '''department_id,number_of_orders,number_of_first_orders,percentage
3,2,1,0.50
4,2,0,0.00
12,1,0,0.00
13,2,1,0.50
16,2,0,0.00'''

lines = s1.split('\n')
# lines
# ['order_id,product_id,add_to_cart_order,reordered', '2,33120,1,1', '2,28985,2,1', '2,9327,3,0', '2,45918,4,1',
#  '3,17668,1,1', '3,46667,2,1', '3,17461,4,1', '3,32665,3,1', '4,46842,1,0']

splitlines = [x.split(',') for x in lines]
# splitlines
# [['order_id', 'product_id', 'add_to_cart_order', 'reordered'], ['2', '33120', '1', '1'], ['2', '28985', '2', '1'],
#  ['2', '9327', '3', '0'], ['2', '45918', '4', '1'], ['3', '17668', '1', '1'], ['3', '46667', '2', '1'],
#  ['3', '17461', '4', '1'], ['3', '32665', '3', '1'], ['4', '46842', '1', '0']]

orders = {}
for j, k in enumerate(splitlines[0]):
    orders[k] = [int(splitlines[i][j]) for i in range(1, len(splitlines))]

# orders
# {'order_id': [2, 2, 2, 2, 3, 3, 3, 3, 4], 'product_id': [33120, 28985, 9327, 45918, 17668, 46667, 17461, 32665, 46842],
#  'add_to_cart_order': [1, 2, 3, 4, 1, 2, 4, 3, 1], 'reordered': [1, 1, 0, 1, 1, 1, 1, 1, 0]}

lines = s2.split('\n')
# lines
# ['product_id,product_name,aisle_id,department_id', '9327,Garlic Powder,104,13',
#  '17461,Air Chilled Organic Boneless Skinless Chicken Breasts,35,12',
#  '17668,Unsweetened Chocolate Almond Breeze Almond Milk,91,16', '28985,Michigan Organic Kale,83,4',
#  '32665,Organic Ezekiel 49 Bread Cinnamon Raisin,112,3', '33120,Organic Egg Whites,86,16', '45918,Coconut Butter,19,13',
#  '46667,Organic Ginger Root,83,4', '46842,Plain Pre-Sliced Bagels,93,3']
splitlines = [x.split(',') for x in lines]
# splitlines
# [['product_id', 'product_name', 'aisle_id', 'department_id'], ['9327', 'Garlic Powder', '104', '13'],
#  ['17461', 'Air Chilled Organic Boneless Skinless Chicken Breasts', '35', '12'],
#  ['17668', 'Unsweetened Chocolate Almond Breeze Almond Milk', '91', '16'],
#  ['28985', 'Michigan Organic Kale', '83', '4'], ['32665', 'Organic Ezekiel 49 Bread Cinnamon Raisin', '112', '3'],
#  ['33120', 'Organic Egg Whites', '86', '16'], ['45918', 'Coconut Butter', '19', '13'],
#  ['46667', 'Organic Ginger Root', '83', '4'], ['46842', 'Plain Pre-Sliced Bagels', '93', '3']]
products = {}
for j, k in enumerate(splitlines[0]):
    products[k] = [splitlines[i][j] for i in range(1, len(splitlines))]

# products
# {'product_id': ['9327', '17461', '17668', '28985', '32665', '33120', '45918', '46667', '46842'],
#  'product_name': ['Garlic Powder', 'Air Chilled Organic Boneless Skinless Chicken Breasts',
#                   'Unsweetened Chocolate Almond Breeze Almond Milk', 'Michigan Organic Kale',
#                   'Organic Ezekiel 49 Bread Cinnamon Raisin', 'Organic Egg Whites', 'Coconut Butter',
#                   'Organic Ginger Root', 'Plain Pre-Sliced Bagels'],
#  'aisle_id': ['104', '35', '91', '83', '112', '86', '19', '83', '93'],
#  'department_id': ['13', '12', '16', '4', '3', '16', '13', '4', '3']}

departments = list(set(products['department_id']))
# departments
# ['13', '16', '12', '3', '4']


order_counts = defaultdict(int)
for thing in products['department_id']:
    order_counts[thing] += 1

# order_counts
# defaultdict( < class 'int'>, {'13': 2, '12': 1, '16': 2, '4': 2, '3': 2})

report = {}
departments.sort(key=lambda x: int(x))
# departments
# ['3', '4', '12', '13', '16']
report['department_id'] = departments
report['number_of_orders'] = [order_counts[dep] for dep in report['department_id']]
# report
# {'department_id': ['3', '4', '12', '13', '16'], 'number_of_orders': [2, 2, 1, 2, 2]}
first_order_count = defaultdict(int)
department_product = defaultdict(list)
for i in range(len(products['product_id'])):
    if products['department_id'][i] in departments:
        department_product[products['department_id'][i]].append(products['product_id'][i])

# department_product
# defaultdict( <class 'list'>, {'13': ['9327', '45918'], '12': ['17461'], '16': ['17668', '33120'], '4': ['28985', '46667'], '3': ['32665', '46842']})

order_first_counts = defaultdict(int)

product_department = {}
for dep, prodlist in department_product.items():
    for prod in prodlist:
        product_department[prod] = dep

# product_department
# {'9327': '13', '45918': '13', '17461': '12', '17668': '16', '33120': '16', '28985': '4', '46667': '4', '32665': '3',
#  '46842': '3'}

first_order_count = defaultdict(int)
for prod, reordered in zip(orders['product_id'], orders['reordered']):
    if product_department[str(prod)] in departments and int(reordered) == 0:
        first_order_count[product_department[str(prod)]] += 1
# first_order_count
# defaultdict(<class 'int'>, {'13': 1, '3': 1})

report['number_of_first_orders'] = [first_order_count[dep] for dep in report['department_id']]
report['first_order_ratio'] = [q[0] / q[1] for q in zip(report['number_of_first_orders'], report['number_of_orders'])]
# report
# {'department_id': ['3', '4', '12', '13', '16'], 'number_of_orders': [2, 2, 1, 2, 2],
#  'number_of_first_orders': [1, 0, 0, 1, 0], 'first_order_ratio': [0.5, 0.0, 0.0, 0.5, 0.0]}
reportdf = pd.DataFrame.from_dict(report)

#   department_id  number_of_orders  number_of_first_orders  first_order_ratio
# 0             3                 2                       1                0.5
# 1             4                 2                       0                0.0
# 2            12                 1                       0                0.0
# 3            13                 2                       1                0.5
# 4            16                 2                       0                0.0

为您提供娱乐。我可能会回来,在建立字典以表示FKey的位置上发表一些战略性评论

干杯!