Question

我正在执行批处理过程。来自像这样的数据集：

数据 =

[
        '{"CustomerId": "f796bce5-f416-502c-a1c5-6e7c57a3676d", "Email": "fname@emailreaction.com", "FirstName": "fname", "Surname": "lname", "DateOfBirth": "1970-02-01"}',
        '{"CustomerId": "f796bce5-f416-502c-a1c5-6e7c57a3676d", "Email": "business@emailreaction.org", "FirstName": "Lan-lor", "Surname": "Lord-Smith", "DateOfBirth": "1966-02-16"}',
        '{"CustomerId": "BBB-6571-589b-8b6e-dd4f6d", "Email": "second@gmail.com", "FirstName": "Mark", "Surname": "Spenser", "DateOfBirth": "1987-09-20"}',
        '{"CustomerId": "EEE-6571-589b-8b6e-dd4f6d", "Email": "fifth@gmail.com", "FirstName": "Bob", "Surname": "Lein", "DateOfBirth": "1986-10-21"}',
        '{"CustomerId": "BBB-6571-589b-8b6e-dd4f6d", "Email": "landlord@emailreaction.org", "FirstName": "Lan-lor", "Surname": "Lord-Smith", "DateOfBirth": "1966-02-16"}',
        '{"CustomerId": "AAA-6571-589b-8b6e-dd4f6d", "Email": "first@gmail.com", "FirstName": "Steve", "Surname": "Jobs", "DateOfBirth": "1985-08-21"}',
        '{"CustomerId": "AAA-6571-589b-8b6e-dd4f6d", "Email": "third@gmail.com", "FirstName": "Jeniffer", "Surname": "Sue", "DateOfBirth": "1981-07-21"}',
        '{"CustomerId": "DDD-6571-589b-8b6e-dd4f6d", "Email": "fourth@gmail.com", "FirstName": "Tim", "Surname": "Rob", "DateOfBirth": "1979-12-17"}'
......
about 1 million rows
......
]

对于批处理，我使用.groupby()进入大熊猫。然后需要从DataFrame转换为dict，并且工作缓慢.to_dict()。在我的职能是： result = [pd.DataFrame.to_dict(group, orient="records") for name, group in group_by]

出什么问题了？

def get_batched_list_by_id(data, batch_by="CustomerId"):
    group_by = pd.DataFrame([json.loads(i) for i in data]).groupby(batch_by)
    result = [pd.DataFrame.to_dict(group, orient="records") for name, group in group_by]
    return result

我期待结果：

[
 [{'CustomerId': 'AAA-6571-589b-8b6e-dd4f6d', 'DateOfBirth': '1985-08-21', 'Email': 'first@gmail.com', 'FirstName': 'Steve', 'Surname': 'Jobs'}, {'CustomerId': 'AAA-6571-589b-8b6e-dd4f6d', 'DateOfBirth': '1981-07-21', 'Email': 'third@gmail.com', 'FirstName': 'Jeniffer', 'Surname': 'Sue'}],
 [{'CustomerId': 'BBB-6571-589b-8b6e-dd4f6d', 'DateOfBirth': '1987-09-20', 'Email': 'second@gmail.com', 'FirstName': 'Mark', 'Surname': 'Spenser'}, {'CustomerId': 'BBB-6571-589b-8b6e-dd4f6d', 'DateOfBirth': '1966-02-16', 'Email': 'landlord@emailreaction.org', 'FirstName': 'Lan-lor', 'Surname': 'Lord-Smith'}],
 [{'CustomerId': 'DDD-6571-589b-8b6e-dd4f6d', 'DateOfBirth': '1979-12-17', 'Email': 'fourth@gmail.com', 'FirstName': 'Tim', 'Surname': 'Rob'}], 
 [{'CustomerId': 'EEE-6571-589b-8b6e-dd4f6d', 'DateOfBirth': '1986-10-21', 'Email': 'fifth@gmail.com', 'FirstName': 'Bob', 'Surname': 'Lein'}], 
 [{'CustomerId': 'f796bce5-f416-502c-a1c5-6e7c57a3676d', 'DateOfBirth': '1970-02-01', 'Email': 'fname@emailreaction.com', 'FirstName': 'fname', 'Surname': 'lname'}, {'CustomerId': 'f796bce5-f416-502c-a1c5-6e7c57a3676d', 'DateOfBirth': '1966-02-16', 'Email': 'business@emailreaction.org', 'FirstName': 'Lan-lor', 'Surname': 'Lord-Smith'}] 
....about 1 million....
]

我明白了，但是该功能大约工作了30分钟

Answer 1

如此简单的groupby不需要熊猫：

from collections import defaultdict

def get_batched_list_by_id_no_pandas(data, batch_by="CustomerId"):
    dicts = json.loads("[" +', '.join(data) + "]")
    # Create a defaultdict of lists
    temp = defaultdict(list)
    for _dict in dicts:
        # Put each sub dict into temp keyed by `batch_by`
        temp[_dict[batch_by]] += [_dict]
    return list(temp.values())

将此功能的时序与您的功能进行比较（仅针对您所显示的示例）：

%timeit get_batched_list_by_id(data)：3.85 ms ± 48.8 µs per loop

%timeit get_batched_list_by_id_no_pandas(data)：13.9 µs ± 60.7 ns

节省近300倍。因此，您在30分钟内完成的工作应在大约7秒钟内完成。

慢速工作将DataFrame转换为dict

1 个答案: