从 df/list of list 中删除特定的重复项

时间:2021-01-28 10:09:12

标签: python pandas

我有以下 Pandas df(虚拟 df,原始文件大约有 50'000 行)。

columns = ['question_id', 'answer', 'is_correct']
data = [['1','hello','1.0'],
       ['1','hello', '1.0'],
       ['1','hello', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'cat', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'the answer is cat', '1.0'],
        ['3', 'Milan', '1.0'],
        ['3', 'Paris', '0.0'],
        ['3', 'The capital is Paris', '0.0'],
        ['3', 'MILAN', '1.0'],
        ['4', 'The capital is Paris', '1.0'],
        ['4', 'London', '0.0'],
        ['4', 'Paris', '1.0'],
        ['4', 'paris', '1.0'],
        ['5', 'lol', '0.0'],
        ['5', 'rofl', '0.0'],
        ['6', '5.5', '1.0'],
        ['6', '5.2', '0.0']]
df = pd.DataFrame(columns=columns, data=data)

我想返回一个列表列表。内部列表应包含来自同一问题的两个正确 (is_correct = 1.0) 答案(a1_correct 和 a2_correct)和一个错误 (is_correct = 0.0) 答案 (a_incorrect)。 重要:如果 a1_correct 等于 a2_correct,则跳过该问题,我不希望 a1_correct 和 a2_correct 重复。每个 question_id 一个内部列表。 question_id 中的其他答案可以简单地忽略。

边缘情况:

  • 所有答案都正确 -> 跳过此问题
  • 所有正确答案均为重复 -> 跳过此问题
  • 没有正确答案 -> 跳过这个问题。例如。输出无。请参阅 question_id = 5
  • 只有一个答案是正确的 -> 跳过这个问题。例如。输出 没有任何。请参阅 question_id = 5

我希望输出的样子:

[['cat', 'the answer is cat', 'dog'], ['Milan', 'MILAN', 'Paris'], ['The capital is Paris', 'paris', 'London']]

我目前的方法包括重复项,我该如何解决?我应该先从 df 中删除重复项,然后创建列表列表吗?

import builtins

def create_triplet(grp):
    is_correct = grp['is_correct'] == 1.0
    is_wrong = grp['is_correct'] == 0.0
    if (is_correct.value_counts().get(True, 0) >= 2) and is_wrong.any():
      a1_correct = grp['answer'][is_correct].iloc[0]
      a2_correct = grp['answer'][is_correct].iloc[1]
      #here I tried to ignore duplicates but it doesn't work
      if a1_correct == a2_correct:
        return
      else: grp['answer'][is_correct].iloc[1]
      incorrect = grp['answer'][is_wrong].iloc[0]
      return [a1_correct, a2_correct, incorrect]

triplets_raw = df.groupby('question_id').apply(create_triplet)
triplets_list = list(builtins.filter(lambda x: (x is not None), triplets_raw.to_list()))

1 个答案:

答案 0 :(得分:1)

由于您不希望正确答案有任何重复,请在选择 2 个正确答案之前使用 drop_duplicates() 以删除正确答案中的任何重复。从这些中选择的 2 个答案将是独一无二的。然后以某种方式选择(最多)2 个答案,同样选择错误的答案。

选择正确和错误的答案后,如果我理解正确,create_triplets 应该只在有 2 个正确和 1 个错误答案要返回时返回一些东西。 例如,len() 可以很好地解决这个问题。

我稍微修改了您提供的代码,产生了预期的输出。

代码中还有一些注释和代码后面的示例输出,用于阐明代码的作用。

import pandas as pd

def create_triplet(grp):
    # Select unique, correct answers
    correct = grp.loc[grp['is_correct'] == '1.0', 'answer'].drop_duplicates()
    # Select up to 2 correct answers and change to a list
    correct = list(correct.iloc[:2])
    # Repeat similarly to wrong answers expect only take up to 1 correct answer(s)
    # The same thing in one line
    # May or may not be easier to read, use whichever you prefer
    # Note: drop_duplicates is not necessary here
    wrong = list(grp.loc[grp['is_correct'] == '0.0', 'answer'].drop_duplicates().iloc[:1])
    # Question should not be skipped when there are (at least)
    # 2 different but correct answers and 1 wrong answer
    if len(correct) == 2 and len(wrong) == 1:
        return correct + wrong
    # Otherwise signify skipping the question by returning None
    return None


columns = ['question_id', 'answer', 'is_correct']
data = [
    ['1', 'hello', '1.0'],
    ['1', 'hello', '1.0'],
    ['1', 'hello', '1.0'],
    ['2', 'dog', '0.0'],
    ['2', 'cat', '1.0'],
    ['2', 'dog', '0.0'],
    ['2', 'the answer is cat', '1.0'],
    ['3', 'Milan', '1.0'],
    ['3', 'Paris', '0.0'],
    ['3', 'The capital is Paris', '0.0'],
    ['3', 'MILAN', '1.0'],
    ['4', 'The capital is Paris', '1.0'],
    ['4', 'London', '0.0'],
    ['4', 'Paris', '1.0'],
    ['4', 'paris', '1.0'],
    ['5', 'lol', '0.0'],
    ['5', 'rofl', '0.0'],
    ['6', '5.5', '1.0'],
    ['6', '5.2', '0.0']
]
df = pd.DataFrame(columns=columns, data=data)
expected = [
    ['cat', 'the answer is cat', 'dog'],
    ['Milan', 'MILAN', 'Paris'],
    ['The capital is Paris', 'paris', 'London']
]

triplets_raw = df.groupby('question_id').apply(create_triplet)
# Triplets_raw is a pandas Series with values being either
# a list of valid responses or None
# dropna() removes rows with None-values, leaving only rows with lists
# The resulting Series is then changed to list as required
triplest_list = list(triplets_raw.dropna())

一些输出:

>>> df.groupby('question_id').apply(create_triplet)
question_id
1                                     None
2            [cat, the answer is cat, dog]
3                    [Milan, MILAN, Paris]
4    [The capital is Paris, Paris, London]
5                                     None
6                                     None
>>> triplets_raw = df.groupby('question_id').apply(create_triplet)
>>> list(triplets_raw.dropna())
[['cat', 'the answer is cat', 'dog'], ['Milan', 'MILAN', 'Paris'], ['The capital is Paris', 'Paris', 'London']]
相关问题