多索引熊猫数据帧

时间:2017-09-12 05:06:19

标签: python pandas

我想知道如何基于对来自另一列的元素进行分组的列表来获取数据帧的多个索引。

由于最好通过示例来展示,这里有一个脚本显示我拥有的内容以及我想要的内容:

def ungroup_column(df, column, split_column = None):
    '''
    # Summary
        Takes a dataframe column that contains lists and spreads the items in the list over many rows
        Similar to pandas.melt(), but acts on lists within the column

    # Example

        input datframe:

                farm_id animals
            0   1       [pig, sheep, dog]
            1   2       [duck]
            2   3       [pig, horse]
            3   4       [sheep, horse]


        output dataframe:

                farm_id animals
            0   1       pig
            0   1       sheep
            0   1       dog
            1   2       duck
            2   3       pig
            2   3       horse
            3   4       sheep
            3   4       horse

    # Arguments

        df: (pandas.DataFrame)
            dataframe to act upon

        column: (String)
            name of the column which contains lists to separate

        split_column: (String)
            column to be added to the dataframe containing the split items that were in the list
            If this is not given, the values will be written over the original column
    '''
    if split_column is None:
        split_column = column

    # split column into mulitple columns (one col for each item in list) for every row
    # then transpose it to make the lists go down the rows
    list_split_matrix = df[column].apply(pd.Series).T

    # Now the columns of `list_split_matrix` (they're just integers)
    # are the indices of the rows in `df` - i.e. `df_row_idx`
    # so this melt concats each column on top of each other
    melted_df = pd.melt(list_split_matrix, var_name = 'df_row_idx', value_name = split_column).dropna().set_index('df_row_idx')

    if split_column == column:
        df = df.drop(column, axis = 1)
        df = df.join(melted_df)
    else:
        df = df.join(melted_df)
    return df

from IPython.display import display
train_df.index
from utils import *
play_df = train_df
sent_idx = play_df.groupby('pmid')['sentence'].apply(lambda row: range(0, len(list(row)))) #set_index(['pmid', range(0, len())])
play_df.set_index('pmid')

import pandas as pd
doc_texts = ['Here is a sentence. And Another. Yet another sentence.',
            'Different Document here. With some other sentences.']
playing_df = pd.DataFrame({'doc':[nlp(doc) for doc in doc_texts],
                           'sentences':[[s for s in nlp(doc).sents] for doc in doc_texts]})
display(playing_df)
display(ungroup_column(playing_df, 'sentences'))

输出如下:

doc sentences
0   (Here, is, a, sentence, ., And, Another, ., Ye...   [(Here, is, a, sentence, .), (And, Another, .)...
1   (Different, Document, here, ., With, some, oth...   [(Different, Document, here, .), (With, some, ...
doc sentences
0   (Here, is, a, sentence, ., And, Another, ., Ye...   (Here, is, a, sentence, .)
0   (Here, is, a, sentence, ., And, Another, ., Ye...   (And, Another, .)
0   (Here, is, a, sentence, ., And, Another, ., Ye...   (Yet, another, sentence, .)
1   (Different, Document, here, ., With, some, oth...   (Different, Document, here, .)
1   (Different, Document, here, ., With, some, oth...   (With, some, other, sentences, .)

但我真的想要为“句子”列添加一个索引,例如:

doc_idx   sent_idx     document                                           sentence
0         0            (Here, is, a, sentence, ., And, Another, ., Ye...   (Here, is, a, sentence, .)
          1            (Here, is, a, sentence, ., And, Another, ., Ye...   (And, Another, .)
          2            (Here, is, a, sentence, ., And, Another, ., Ye...   (Yet, another, sentence, .)
1         0            (Different, Document, here, ., With, some, oth...   (Different, Document, here, .)
          1            (Different, Document, here, ., With, some, oth...   (With, some, other, sentences, .)

1 个答案:

答案 0 :(得分:1)

根据您的第二个输出,您可以重置索引,然后根据当前索引的cumcount重置set_index,然后重命名轴,即

new_df = ungroup_column(playing_df, 'sentences').reset_index()
new_df['sent_idx'] = new_df.groupby('index').cumcount() 
new_df.set_index(['index','sent_idx']).rename_axis(['doc_idx','sent_idx'])

输出:

                                                               doc       sents
doc_idx sent_idx                                                      
0       0         [Here, is, a, sentence, ., And, Another, ., Ye...     Here is a sentence.
        1         [Here, is, a, sentence, ., And, Another, ., Ye...     And Another.  
        2         [Here, is, a, sentence, ., And, Another, ., Ye...     Yet another sentence.  
1       0         [Different, Document, here, ., With, some, oth...     Different Document here.
        1         [Different, Document, here, ., With, some, oth...     With some other sentences.  

您可以使用np.concatenate扩展列,而不是应用pd.Series。( 我使用nltk来标记单词和句子)

import nltk
import pandas as pd
doc_texts = ['Here is a sentence. And Another. Yet another sentence.',
        'Different Document here. With some other sentences.']
playing_df = pd.DataFrame({'doc':[nltk.word_tokenize(doc) for doc in doc_texts],
                      'sents':[nltk.sent_tokenize(doc) for doc in doc_texts]})

s = playing_df['sents']
i = np.arange(len(df)).repeat(s.str.len())

new_df = playing_df.iloc[i, :-1].assign(**{'sents': np.concatenate(s.values)}).reset_index()

new_df['sent_idx'] = new_df.groupby('index').cumcount()
new_df.set_index(['index','sent_idx']).rename_axis(['doc_idx','sent_idx'])

希望它有所帮助。