Question

我有两个CSV文件：

“ inventory.csv”列为'Code'，...，'Sum（Quantity）'

“ sales.csv”，其中包含列“代码”，其他列-

销售中的“代码”列包含某个产品代码（字符串）

例如：'6ES7 122-1BB10-0AA0'

sales.csv中的每个“代码”都存在于stock.csv中（反之则不然）

我的目标是制作一个包含以下列的数据框：

“代码”，“总和（数量）”，-其他许多列-

我写了一些代码可以达到这个目的（如下所示），但是它返回以下错误：

IndexError：单个位置索引器超出范围

sales.csv的一部分：

代码； ...; ...; “总和（数量）”

...

SSD 2.5“ 32GB SATA III; Onbekend; SSD; 4;

6SE7018-0EP50-Z（Z = C23 + F01 + G91）;西门子; Masterdrive MC; 4;

QS30LLPCQ;横幅; 3071378; 4;

6ES5 318-8MB12;西门子;接口模块：ET200U; 4;

...

inventory.csv的一部分：

代码； -许多其他列-

...

6SE7018-0EP50-Z;西门子; Masterdrive MC; 0; 0; 0; 0; 0; 0; 0; 1;

6SE7018-0EP50-Z（Z = C23 + F01 + G91）;西门子; Masterdrive MC; 0; 0; 0; 0; 0; 0; 0; 0;

6SE7018-0EP50-Z（Z = L20）;西门子; Masterdrive MC; 0; 0; 1; 0; 0; 0; 0; 0;

6SE7018-0EP60;西门子; Masterdrive VC; 0; 0; 0; 0; 0; 0; 0; 0;

...

在stock.csv中搜索'6SE7018-0EP50-Z（Z = C23 + F01 + G91）'时发生错误，所以我想这是因为以下标志之一：（）+ =

我使用的代码如下：

import pandas as pd

filename_sales = "sales.csv"
filename_inv = "inventory.csv"

df_sales = pd.read_csv(filename_sales, sep=';')
df_inv = pd.read_csv(filename_inv, sep=';')

#throws away unneeded columns
df_sales = df_sales[["Code","Sum(Quantity)"]]
df_inv = df_inv.drop(['Brand Name', 'Name'], axis=1)

df_out = pd.DataFrame()


for index, row in df_sales.iterrows():
   temp = df_inv[df_inv['Code'].str.contains(row["Code"])].iloc[0] 
   temp["Sum(Quantity)"] = row['Sum(Quantity)']
   df_out = df_out.append(temp)

print(df_out)

如何避免/解决此错误？
这是将这些df加入我想要的东西的适当方法吗？

Answer 1

我建议使用df.merge()连接两个表，然后可以像使用sql表一样连接两个数据框。

确保您加入='Code'。您可以指定左连接或内部连接，然后可以避免使用熊猫dfs造成混乱，痛苦，缓慢的迭代。

例如：

df_sales = df_sales[["Code","Sum(Quantity)"]]
df_inv = df_inv.drop(['Brand Name', 'Name'], axis=1)
#once you make sure you have the columns you want, go ahead and join them

df_sales = df_sales.merge(df_inv, how='left', on='Code')

在熊猫问题中使用特殊字符连接数据框

1 个答案: