Question

我有一个文件“ customers.txt”，其中包含每个客户购买了哪些商品的数据，并具有以下格式：

0   customer_21: item_575,item_2703,...
1   customer_11: item_454,item_158,...
2   customer_10: item_1760,item_613,...
3   customer_4: item_1545,item_1312,...
4   customer_6: item_2608,item_1062,...
5   customer_23: item_1659,item_2610,...
6   customer_14: item_2858,item_2007,...
7

另一个csv文件“ stores.txt”，其中包含每个商店中商品的数据。我已阅读并制作了如下数据框：

customers_df = pd.DataFrame(index = stores.Item.unique(),
            columns = [line.split(':')[0] for line in open('customers.txt').readlines()])
for customer in customers_df.columns:
    for item in customers_df.index:
        customers_df.loc[item, customer] = item in customers_dict[customer]

但是，当我为更多客户添加数据时，代码速度呈指数下降。是否有一种有效的方法来实现这一目标？最终目标是在一个我可以做进一步分析的地方获得所有有关哪些用户购买了特定物品的信息。文件正在自动更新，到目前为止，大约需要6-7分钟。

Answer 1

在这种情况下，我将使用sklearn.preprocessing.MultiLabelBinarizer：

from sklearn.preprocessing import MultiLabelBinarizer

df = pd.read_csv(filenames, sep=":\s*", header=None, names=["cust", "items"])

res = (pd.DataFrame.sparse.from_spmatrix(mlb.fit_transform(df["items"].str.split(",")), 
                                         index=df.index, 
                                         columns=mlb.classes_)
         .set_index(df["cust"]))

结果：

In [24]: res
Out[24]:
             item_1062  item_1312  item_1545  item_158  item_1659  ...  item_2703  item_2858  item_454  item_575  \
cust                                                               ...
customer_21          0          0          0         0          0  ...          1          0         0         1
customer_11          0          0          0         1          0  ...          0          0         1         0
customer_10          0          0          0         0          0  ...          0          0         0         0
customer_4           0          1          1         0          0  ...          0          0         0         0
customer_6           1          0          0         0          0  ...          0          0         0         0
customer_23          0          0          0         0          1  ...          0          0         0         0
customer_14          0          0          0         0          0  ...          0          1         0         0

             item_613
cust
customer_21         0
customer_11         0
customer_10         1
customer_4          0
customer_6          0
customer_23         0
customer_14         0

[7 rows x 14 columns]

In [25]: res.dtypes
Out[25]:
item_1062    Sparse[int32, 0]
item_1312    Sparse[int32, 0]
item_1545    Sparse[int32, 0]
item_158     Sparse[int32, 0]
item_1659    Sparse[int32, 0]
item_1760    Sparse[int32, 0]
item_2007    Sparse[int32, 0]
item_2608    Sparse[int32, 0]
item_2610    Sparse[int32, 0]
item_2703    Sparse[int32, 0]
item_2858    Sparse[int32, 0]
item_454     Sparse[int32, 0]
item_575     Sparse[int32, 0]
item_613     Sparse[int32, 0]
dtype: object

将数据框列值与列表组合以创建新的数据框

1 个答案: