根据序列块从DF列中提取信息

时间:2019-03-29 17:25:42

标签: python pandas dataframe

遇到了一个复杂的问题,过去几天我一直无法解决。

给出以下DF,

enter image description here

我想结束:

enter image description here

本质上,我们在“ user_entry_note”列中查看是否已按顺序成功输入给定序列的某些块。

要从序列中获取块,我使用以下函数:

def get_chunks_from_seq(seq_id):

    tidy = tidy_string(seq_id)
    'work out for all the possible chunks'

    # work out the chunks
    ord_chunks = [tidy[i:j] for i, j in itertools.combinations(range(len(tidy)+1), 2)]

    return(ord_chunks)

按顺序返回所有可能的块的列表。

现在,我在不使用大量数据框的情况下难以实现各种目标。我认为我可能会从流程的早期就遗忘了一个窍门。

此处“ seq”是原始序列,“块”是该序列的组成部分。整个序列在第二阶段也变成了一个块。

对于每个“大块”,我想知道它的“完成” trial_ms(按“ user_entry_note”列中的顺序播放)以及此时用户entry_error_no和userentries_plybs中的值。

我设法做到这一点:

# get a list of the possible chunks based on the sequence
    chunks = get_chunks_from_seq(df1['seq'][0])

    # create df of chunks and their completion indexes
    h = [find_idx(seq, df1, 'user_entry_note') for seq in chunks]

    # list of the chunks themselves
    h2 = [seq for seq in chunks]

    # column of chunk lens
    h3 = [len(seq) if isinstance(seq, list) is True else 1 for seq in chunks]

    # create strings of these
    h2_str = []
    for p in h2:
        if type(p) == list:
            p = list_to_string(p)
            h2_str.append(p)
        else: 
            h2_str.append(str(p))

    # make df to format them
    df1_2 = pd.DataFrame({'chunk_idx__completion_in_trial': h,'chunk': h2_str,'chunk_len': h3 })


    # sub df
    subdf1 = ['user_id','timecode','user_entries_error_no', 'user_entries_plybs']
    df1_3 = df1.iloc[h,:][subdf1].reset_index()

    #tie everything together
    keep = ['chunk','user_id','timecode','user_entries_error_no','user_entries_plybs']
    df2 = df1_2.join(df1_3)[keep]

但是我认为我需要放弃这种方法来实现我的第二个目标,这就是我感到困惑的地方。

除此之外,我想知道何时传递了块中的每个音符(trial_ms)何时传递了该块(但不知道这些音符可能何时出现过)。

换句话说,在下面的示例中:

enter image description here

对于块“ 40-30”,n1将是7 n2将是索引8,因为该块已在8中完成。在索引2中出现40无关紧要。但是,索引2将是正确的索引在这种情况下(对于n = 1的所有块),块“ 40”的n1也等于“ chunk_completed”列。

可复制的DF:

    f = {'seq': {0: '60-40-30',
  1: '60-40-30',
  2: '60-40-30',
  3: '60-40-30',
  4: '60-40-30',
  5: '60-40-30',
  6: '60-40-30',
  7: '60-40-30',
  8: '60-40-30'},
 'seq_len': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3},
 'seq_list': {0: [60, 40, 30],
  1: [60, 40, 30],
  2: [60, 40, 30],
  3: [60, 40, 30],
  4: [60, 40, 30],
  5: [60, 40, 30],
  6: [60, 40, 30],
  7: [60, 40, 30],
  8: [60, 40, 30]},

 'trial_ms': {0: -9223372037,
  1: -18963961,
  2: 31992270,
  3: -13028311,
  4: -18963961,
  5: 31992270,
  6: -13028311,
  7: -18963961,
  8: 31992270},
 'user_entries_error_no': {0: 1,
  1: 2,
  2: 6,
  3: 2,
  4: 3,
  5: 3,
  6: 3,
  7: 2,
  8: 4},
 'user_entries_plybs': {0: 2, 1: 3, 2: 3, 3: 2, 4: 3, 5: 3, 6: 1, 7: 1, 8: 4},
 'user_entry_note': {0: 23,
  1: 60,
  2: 40,
  3: 30,
  4: 40,
  5: 3,
  6: 3,
  7: 2,
  8: 4},
 'user_id': {0: 'seb',
  1: 'seb',
  2: 'seb',
  3: 'seb',
  4: 'seb',
  5: 'seb',
  6: 'seb',
  7: 'seb',
  8: 'seb'}}



 df1 = pd.DataFrame().from_dict(f)

0 个答案:

没有答案