我有一个模型,该模型可以按照特定可能性预测特定课程的10个单词,我希望课程描述中出现的这些单词的前5个单词。
这是数据的格式:
course_name course_title course_description predicted_word_10 predicted_word_9 predicted_word_8 predicted_word_7 predicted_word_6 predicted_word_5 predicted_word_4 predicted_word_3 predicted_word_2 predicted_word_1
Xmath 32 Precalculus Polynomial and rational functions, exponential... directed scholars approach build african different visual cultures placed global
Xphilos 2 Morality Introduction to ethical and political philosop... make presentation weekly european ways general range questions liberal speakers
我的想法是让每一行都从predicted_word_1
开始迭代,直到获得描述中的前5个。我想按出现在其他列description_word_1
... description_word_5
中的顺序保存这些单词。 (如果描述中有<5个预测单词,我打算在相应的列中返回NAN。)
举例说明:如果课程的course_description
是'Polynomial and rational functions, exponential and logarithmic functions, trigonometry and trigonometric functions. Complex numbers, fundamental theorem of algebra, mathematical induction, binomial theorem, series, and sequences. '
,而其前几个预测单词是irrelevantword1, induction, exponential, logarithmic, irrelevantword2, polynomial, algebra...
我希望以此顺序返回induction, exponential, logarithmic, polynomial, algebra
,并在其余课程中也这样做。
我的尝试是定义一个应用函数,该函数将连续执行并从第一个预测的单词开始迭代,直到找到描述中的前5个,但我无法弄清的部分是如何创建这些每门课程都有正确词的其他列。该代码当前仅对所有行保留一门课程的单词。
def find_top_description_words(row):
print(row['course_title'])
description_words_index=1
for i in range(num_words_per_course):
description = row.loc['course_description']
word_i = row.loc['predicted_word_' + str(i+1)]
if (word_i in description) & (description_words_index <=5) :
print(description_words_index)
row['description_word_' + str(description_words_index)] = word_i
description_words_index += 1
df.apply(find_top_description_words,axis=1)
此数据处理的最终目标是在模型中保留模型中的前10个预测词和描述中的前5个预测词,以便数据框看起来像:
course_name course_title course_description top_description_word_1 ... top_description_word_5 predicted_word_1 ... predicted_word_10
任何指针将不胜感激。谢谢!
答案 0 :(得分:1)
如果我正确理解:
仅用100个预测单词创建新的DataFrame:
pred_words_lists = df.apply(lambda x: list(x[3:].dropna())[::-1], axis = 1)
请注意,每行都有带有预测单词的列表。顺序很好,我的意思是第一个而不是空的预测单词位于第一位,第二位位于第二位,依此类推。
现在让我们创建一个新的DataFrame:
pred_words_df = pd.DataFrame(pred_words_lists.tolist())
pred_words_df.columns = df.columns[:2:-1]
最后一个DataFrame:
final_df = df[['course_name', 'course_title', 'course_description']].join(pred_words_df.iloc[:,0:11])
希望这行得通。
编辑
def common_elements(xx, yy):
temp = pd.Series(range(0, len(xx)), index= xx)
return list(df.reindex(yy).sort_values()[0:10].dropna().index)
pred_words_lists = df.apply(lambda x: common_elements(x[2].replace(',','').split(), list(x[3:].dropna())), axis = 1)
它满足您的要求吗?
自适应解决方案(OP):
def get_sorted_descriptions_words(course_description, predicted_words, k):
description_words = course_description.replace(',','').split()
predicted_words_list = list(predicted_words)
predicted_words = pd.Series(range(0, len(predicted_words_list)), index=predicted_words_list)
predicted_words = predicted_words[~predicted_words.index.duplicated()]
ordered_description = predicted_words.reindex(description_words).dropna().sort_values()
ordered_description_list = pd.Series(ordered_description.index).unique()[:k]
return ordered_description_list
df.apply(lambda x: get_sorted_descriptions_words(x['course_description'], x.filter(regex=r'predicted_word_.*'), k), axis=1)