我的数据集中有三列,其中包含存储为字符串的元素列表。我想获取所有这些列表中常见的元素。
请记住以下几点: 1)列可以包含'NAN'值而不是列表 2)如果有多个公共元素,那么选择哪一个元素并不重要
我编写了以下函数,该函数将包含这三个相关列的数据框作为参数:
def parse_genres(df):
def get_genre(row):
row_list = []
for element in row:
if element != 'NAN':
y = ast.literal_eval(element)
for genre in y:
if genre not in row_list:
row_list += genre
unique = set(row_list)
return list(unique)[0]
result = df.apply(get_genre)
return result
输入:
index col1 col2 col3
0 NAN NAN NAN
1 ['hip hop', 'trap'] ['indie', 'trap'] NAN
2 ['pop', 'viral pop'] ['dance pop', 'pop', 'post-teen pop'] NAN
预期输出:
index col
0 NAN
1 'trap'
2 'pop'
答案 0 :(得分:0)
进行了一些优化。看看是否有帮助。
import pandas as pd
import numpy as np
data = {
'col1': [np.nan,['hip hop', 'trap'],['pop', 'viral pop','post-teen pop']],
'col2': [np.nan,['indie', 'trap'],['dance pop', 'pop', 'post-teen pop']],
'col3': [np.nan,np.nan,np.nan]
}
df = pd.DataFrame(data)
result_df = pd.DataFrame(columns=['common_words'])
for idx, rows in enumerate(df.iterrows()):
new_set = None
valid_set_found = False
for i in range(len(rows[1])):
if isinstance(rows[1][i], list):
if valid_set_found is False:
new_set = set(rows[1][i])
valid_set_found = True
continue
new_set = set(rows[1][i]) & new_set
if new_set is None:
result_df.loc[idx] = np.nan
else:
new_list = list(new_set)
result_df.loc[idx] = [new_list]
print(df)
print(result_df)
Input :
col1 col2 col3
0 NaN NaN NaN
1 [hip hop, trap] [indie, trap] NaN
2 [pop, viral pop, post-teen pop] [dance pop, pop, post-teen pop] NaN
Output :
common_words
0 NaN
1 [trap]
2 [pop, post-teen pop]