Question

我正在学习python，并从Kaggle那里获取了一个数据集，以进一步了解python中的数据探索和可视化。

我在数据框中有以下格式的“美食”列：

North Indian, Mughlai, Chinese
Chinese, North Indian, Thai
Cafe, Mexican, Italian
South Indian, North Indian
North Indian, Rajasthani
North Indian
North Indian, South Indian, Andhra, Chinese

我想用逗号分割此列，并从此列获取唯一值。我想将那些唯一值作为新列添加回原始数据框中。

根据其他帖子，我尝试了以下操作：

1）隐式列出并设置并展平以获得唯一值

Type函数返回该列的Series。将其转换为列表然后进行设置会引发错误


type(fl1.cuisines)
pandas.core.series.Series

cuisines_type = fl1['cuisines'].tolist()
type(cuisines_type)
list

cuisines_type
#this returns list of cuisines

cuisines_set = set([ a for b in cuisines_type for a in b])
TypeError: 'float' object is not iterable

2）将其转换为数组并转换为列表

cs = pd.unique(fl1['cuisines'].str.split(',',expand=True).stack())

type(cs)
Out[141]: numpy.ndarray

cs.tolist()

这将返回列表。但是我无法删除已添加到某些元素中的空格。

预期输出是美食的唯一列表，并将其添加回列中：

北印度|穆格莱|中文

Answer 1

我认为您需要Series.str.get_dummies，并且如果可能的话，请按每列max将其删除-输出总是0或1的{{1}}作为计数值：

sum

类似的方法可能与get_dummies一起使用您的解决方案：

df = fl1.cuisines.str.get_dummies(', ').max(level=0, axis=1)
#if need count values
#df = fl1.cuisines.str.get_dummies(', ').sum(level=0, axis=1)
print (df)
   Andhra  Cafe  Chinese  Italian  Mexican  Mughlai  North Indian  Rajasthani  \
0       0     0        1        0        0        1             1           0   
1       0     0        1        0        0        0             1           0   
2       0     1        0        1        1        0             0           0   
3       0     0        0        0        0        0             1           0   
4       0     0        0        0        0        0             1           1   
5       0     0        0        0        0        0             1           0   
6       1     0        1        0        0        0             1           0   

   South Indian  Thai  
0             0     0  
1             0     1  
2             0     0  
3             1     0  
4             0     0  
5             0     0  
6             1     0

Answer 2

我想用逗号分隔此列并从中获取唯一值柱。我想添加回原始数据的那些唯一值框为新列

a = list(set([i.strip() for i in ','.join(df['cuisine']).split(',')]))

输出

['Thai',
 'Mughlai',
 'Mexican',
 'Rajasthani',
 'Andhra',
 'Chinese',
 'North Indian',
 'Cafe',
 'Italian',
 'South Indian']

使用pd.assign将这些列添加回原始df

df.assign(**{i:0 for i in a})

Answer 3

将fie保存为csv，然后使用pandas .read_csv()方法加载它。然后对每个列进行解析，将每个列放置在各自的列表中，然后对每个列表取唯一值。

使用这些新列表中的值和现在唯一的条目初始化一个新的DataFrame。

df = pd.read_csv('cuisine.csv')
column_1_lst = list(set(df.iloc[:,0].values.tolist()))
.                                                        # period here means up to, like (1, 2,....,n) notation
.
column_n_lst = list(set(df.iloc[:,n].values.tolist()))

new_dataframe = pd.DataFrame()
new_dataframe['Column_1_unique'] = column_1_lst
.
.
new_dataframe['Column_n_unique'] = column_n_lst

注意：只需确保您所有列表的长度都相同，即可使用。

希望这会有所帮助：））

拆分列>>获取唯一值>>将唯一值添加回列

3 个答案: