我的数据集看起来像这样(简化):
+----+------+-------------------------------+
| ID | Name | Options |
+----+------+-------------------------------+
| 1 | John | {Sofa,Fridge,Pets,TV} |
| 2 | Mary | {TV,Sofa,Fridge,Parking} |
| 3 | Bob | {TV,Sofa,Parking,Pets,Fridge} |
| 4 | Todd | {TV,Sofa,Fridge,Pets,AC} |
+----+------+-------------------------------+
我的预期输出
+----+------+----+------+--------+---------+------+----+
| ID | Name | TV | Sofa | Fridge | Parking | Pets | AC |
+----+------+----+------+--------+---------+------+----+
| 1 | John | 1 | 1 | 1 | 0 | 1 | 0 |
| 2 | Mary | 1 | 1 | 1 | 1 | 0 | 0 |
| 3 | Bob | 1 | 1 | 1 | 1 | 1 | 0 |
| 4 | Todd | 1 | 1 | 1 | 0 | 1 | 1 |
+----+------+----+------+--------+---------+------+----+
我的代码
import numpy as np
import pandas as pd
pd.set_option("max_columns", None)
listings = pd.read_csv("../listings.csv")
final_list = list(map(lambda val:val.replace("{","").replace("}","") , listings['amenities']))
final_list_1 = ""
for values in final_list:
final_list_1 += "," + values
final_list_2 = final_list_1.split(',')
print(list(set(final_list_2))[1:])
通过上面的输出,我能够获得该列中的每个唯一值,例如
['TV','Sofa','Fridge','Pets','AC','Parking']
从这里开始,我尝试运行一个for
循环,并检查该值是否在行中可用,然后放入true(1)或false(0)。
我有大约50个这样的选项,所以有50个新列。这看起来确实很重要,但是没有聚合。
但是,我不确定如何将
行值内的这些列表值转换为熊猫数据框中的布尔值作为它们各自的新列。
答案 0 :(得分:1)
import numpy as np
import pandas as pd
# Load the dataset
data = [[1, "John", "{Sofa,Fridge,Pets,TV}"],
[1, "Mary", "{TV,Sofa,Fridge,Parking}"],
[1, "Bob", "{TV,Sofa,Parking,Pets,Fridge}"],
[1, "Todd", "{TV,Sofa,Fridge,Pets,AC}"]]
df = pd.DataFrame(data, columns=["ID", "Name", "Options"])
# Replace curly brackets
df.Options = df.Options.str.replace("{","").str.replace("}","")
# Extract amenities per row and their unique values (these will be our new colums)
options_per_row = df.Options.str.split(',').tolist()
unique_values = np.unique(np.concatenate(options_per_row))
# We don't need "Options" column anymore
df = df.drop('Options', axis=1)
# Use list comprehension to combine each row's "options" with unique_values list - results in table of 0's and 1's
binarised = [[1 if unique in el else 0 for unique in unique_values] for el in options_per_row]
# Make it a dataframe to easily concatenate with the original dataframe
binarised_df = pd.DataFrame(binarised, columns=unique_values)
# Concatenate columns together.
result = pd.concat([df, binarised_df], axis=1)
print(result)
产生:
ID Name AC Fridge Parking Pets Sofa TV
0 1 John 0 1 0 1 1 1
1 1 Mary 0 1 1 0 1 1
2 1 Bob 0 1 1 1 1 1
3 1 Todd 1 1 0 1 1 1
如果列的顺序很重要,那么您将不得不弄弄代码,但这是要旨。
编辑:
进一步解释-列表理解等效于:
binarised = []
for options in options_per_row:
binarised_row = []
for unique in unique_values:
if unique in options:
binarised_row.append(1)
else:
binarised_row.append(0)
binarised.append(binarised_row)
,在这种情况下,作为中间结果会产生:
[[0, 1, 0, 1, 1, 1], [0, 1, 1, 0, 1, 1], [0, 1, 1, 1, 1, 1], [1, 1, 0, 1, 1, 1]]
然后变成binarised_df
:
AC Fridge Parking Pets Sofa TV
0 0 1 0 1 1 1
1 0 1 1 0 1 1
2 0 1 1 1 1 1
3 1 1 0 1 1 1