寻找一种方法来加速以下代码

时间:2019-08-23 11:23:56

标签: python pandas

我的数据集中有三列,其中包含存储为字符串的元素列表。我想获取所有这些列表中常见的元素。

请记住以下几点: 1)列可以包含'NAN'值而不是列表 2)如果有多个公共元素,那么选择哪一个元素并不重要

我编写了以下函数,该函数将包含这三个相关列的数据框作为参数:

def parse_genres(df):

    def get_genre(row):
        row_list = []
        for element in row:
            if element != 'NAN':
                y = ast.literal_eval(element)
                for genre in y:
                    if genre not in row_list:
                        row_list += genre
        unique = set(row_list)
        return list(unique)[0]

    result = df.apply(get_genre)
    return result

输入:

index   col1                  col2                                    col3
0       NAN                   NAN                                     NAN
1       ['hip hop', 'trap']   ['indie', 'trap']                       NAN
2       ['pop', 'viral pop']  ['dance pop', 'pop', 'post-teen pop']   NAN

预期输出:

index   col
0       NAN
1       'trap'
2       'pop'

1 个答案:

答案 0 :(得分:0)

进行了一些优化。看看是否有帮助。

import pandas as pd
import numpy as np

data = {
    'col1': [np.nan,['hip hop', 'trap'],['pop', 'viral pop','post-teen pop']],
    'col2': [np.nan,['indie', 'trap'],['dance pop', 'pop', 'post-teen pop']],
    'col3': [np.nan,np.nan,np.nan]
}

df = pd.DataFrame(data)
result_df = pd.DataFrame(columns=['common_words'])
for idx, rows in enumerate(df.iterrows()):
    new_set = None
    valid_set_found = False
    for i in range(len(rows[1])):
        if isinstance(rows[1][i], list):
            if valid_set_found is False:
                new_set = set(rows[1][i])
                valid_set_found = True
                continue
            new_set = set(rows[1][i]) & new_set

    if new_set is None:
        result_df.loc[idx] = np.nan
    else:
        new_list = list(new_set)
        result_df.loc[idx] = [new_list]

print(df)
print(result_df)

Input :
                              col1                             col2  col3
0                              NaN                              NaN   NaN
1                  [hip hop, trap]                    [indie, trap]   NaN
2  [pop, viral pop, post-teen pop]  [dance pop, pop, post-teen pop]   NaN
Output : 
           common_words
0                   NaN
1                [trap]
2  [pop, post-teen pop]