Question

我遇到了数据科学问题，需要使用两个csv文件中提供的信息来创建测试集。

问题

data1.csv

cat，In1，In2
aaa，0、1
aaa，2，1
aaa，2，0
aab，3，2
aab，1、2

data2.csv

cat，index，attribute1，attribute2
aaa，0、150、450
aaa，1、250、670
aaa，2、30、250
aab，0、60、650
aab，1、50、30
aab，2，20，680
aab，3、380、250

从这两个文件中，我需要一个更新的data1.csv文件。我需要在特定类别（目录）下代替In1和In2的特定索引（In1和In2）的属性。

注意：特定类别（猫）中的所有索引都有其自己的属性。

结果应如下所示，

updated_data1.csv

cat，In1a1，In1a2，In2a1，In2a2
aaa，150、450、250、670
aaa，30、250、250、670
aaa，30、250、150、450
aab，380、250、20、680
aab，50、30、20、680

我需要一种使用python中的pandas解决此问题的方法。到目前为止，我已经将csv文件加载到了我的jupyter笔记本中。而且我不知道从哪里开始。

请注意，这是我使用python进行数据处理的第一周，而我对python的了解很少。也请原谅我难看的格式。我正在用手机打这个问题。

Answer 1

如其他人所建议，您可以使用pd.merge。在这种情况下，您需要在多个列上合并。基本上，您需要定义left数据帧（此处data1）中的哪些列映射到right数据帧（此处data2）中的哪些列。另请参见pandas merging 101。

# Read the csvs
data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')
# DataFrame with the in1 columns
df1 = pd.merge(left=data1, right=data2, left_on = ['cat','In1'], right_on = ['cat', 'index'])
df1 = df1[['cat','attribute1','attribute2']].set_index('cat')
# DataFrame with the in2 columns
df2 = pd.merge(left=data1, right=data2, left_on = ['cat','In2'], right_on = ['cat', 'index'])
df2 = df2[['cat','attribute1','attribute2']].set_index('cat')
# Join the two dataframes together.
df = pd.concat([df1, df2], axis=1)
# Name the columns as desired
df.columns = ['in1a1', 'in1a2', 'in2a1', 'in2a2']

通常应该尝试避免通过DataFrames进行迭代，因为它效率不高。但这绝对是一个可行的解决方案。

# Read the csvs
data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')
# This list will be the data for the resulting DataFrame
rows = []
# Iterate through data1, unpacking values in each row to variables
for idx, cat, in1, in2 in data1.itertuples():
    # Create a dictionary for each row where the keys are the column headers of the future DataFrame
    row = {}
    row['cat'] = cat
    # Pick the correct row from data2
    in1 = (data2['index'] == in1) & (data2['cat'] == cat)
    in2 = (data2['index'] == in2) & (data2['cat'] == cat)
    # Assign the correct values to the keys in the dictionary 
    row['in1a1'] = data2.loc[in1, 'attribute1'].values[0]
    row['in1a2'] = data2.loc[in1, 'attribute2'].values[0]
    row['in2a1'] = data2.loc[in2, 'attribute1'].values[0]
    row['in2a2'] = data2.loc[in2, 'attribute2'].values[0]
    # Append the dictionary to the list
    rows.append(row)
# Construct a DataFrame from the list of dictionaries
df = pd.DataFrame(rows)

使用其他csv文件中的数据在csv文件中创建新列

1 个答案: