我有一个文件“ customers.txt”,其中包含每个客户购买了哪些商品的数据,并具有以下格式:
0 customer_21: item_575,item_2703,...
1 customer_11: item_454,item_158,...
2 customer_10: item_1760,item_613,...
3 customer_4: item_1545,item_1312,...
4 customer_6: item_2608,item_1062,...
5 customer_23: item_1659,item_2610,...
6 customer_14: item_2858,item_2007,...
7
另一个csv文件“ stores.txt”,其中包含每个商店中商品的数据。我已阅读并制作了如下数据框:
customers_df = pd.DataFrame(index = stores.Item.unique(),
columns = [line.split(':')[0] for line in open('customers.txt').readlines()])
for customer in customers_df.columns:
for item in customers_df.index:
customers_df.loc[item, customer] = item in customers_dict[customer]
但是,当我为更多客户添加数据时,代码速度呈指数下降。是否有一种有效的方法来实现这一目标?最终目标是在一个我可以做进一步分析的地方获得所有有关哪些用户购买了特定物品的信息。文件正在自动更新,到目前为止,大约需要6-7分钟。
答案 0 :(得分:0)
在这种情况下,我将使用sklearn.preprocessing.MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
df = pd.read_csv(filenames, sep=":\s*", header=None, names=["cust", "items"])
res = (pd.DataFrame.sparse.from_spmatrix(mlb.fit_transform(df["items"].str.split(",")),
index=df.index,
columns=mlb.classes_)
.set_index(df["cust"]))
结果:
In [24]: res
Out[24]:
item_1062 item_1312 item_1545 item_158 item_1659 ... item_2703 item_2858 item_454 item_575 \
cust ...
customer_21 0 0 0 0 0 ... 1 0 0 1
customer_11 0 0 0 1 0 ... 0 0 1 0
customer_10 0 0 0 0 0 ... 0 0 0 0
customer_4 0 1 1 0 0 ... 0 0 0 0
customer_6 1 0 0 0 0 ... 0 0 0 0
customer_23 0 0 0 0 1 ... 0 0 0 0
customer_14 0 0 0 0 0 ... 0 1 0 0
item_613
cust
customer_21 0
customer_11 0
customer_10 1
customer_4 0
customer_6 0
customer_23 0
customer_14 0
[7 rows x 14 columns]
In [25]: res.dtypes
Out[25]:
item_1062 Sparse[int32, 0]
item_1312 Sparse[int32, 0]
item_1545 Sparse[int32, 0]
item_158 Sparse[int32, 0]
item_1659 Sparse[int32, 0]
item_1760 Sparse[int32, 0]
item_2007 Sparse[int32, 0]
item_2608 Sparse[int32, 0]
item_2610 Sparse[int32, 0]
item_2703 Sparse[int32, 0]
item_2858 Sparse[int32, 0]
item_454 Sparse[int32, 0]
item_575 Sparse[int32, 0]
item_613 Sparse[int32, 0]
dtype: object