Pandas groupby导致数据帧

时间:2018-04-04 12:13:56

标签: python-3.x pandas-groupby

我正在处理包含重复项的数据。如果行的“similarity_index”等于另一行,则表示它们是重复的。我正在尝试合并这些副本。

这是我的DataFrame:

           ad    soyad similarity_index
0       hakan  özdemir                0
1       hasan    yaman                1
2        naci    şenli                2
3      naciye      şen                2
4       osman    uygur                3
5        elif    sözen                4
6        irem   derici                5

这是我尝试做的事情:

test_df.set_index("similarity_index").sort_index()

这是输出:

                          ad    soyad
similarity_index                     
0                      hakan  özdemir
0                 hakan utku  özdemir
1                      hasan    yaman
2                       naci    şenli
2                     naciye      şen
3                      osman    uygur
4                       elif    sözen
5                       irem   derici
5                       irem   delici
6                       hako  özdemir

这就是我想要的:

                          ad    soyad
similarity_index                     
0                      hakan  özdemir
                  hakan utku  özdemir
1                      hasan    yaman
2                       naci    şenli
                      naciye      şen
3                      osman    uygur
4                       elif    sözen
5                       irem   derici
                        irem   delici
6                       hako  özdemir

有了这个,我试图完成选择具有相同索引的重复行。我尝试了groupby()pivot_table()。但我找不到合适的方法。

1 个答案:

答案 0 :(得分:1)

您想要的实际上是熊猫默认索引功能的自定义功能。

public class Test {
public static void main(String[] args) throws IOException {
    File[] paths = File.listRoots();
    for (int i = 0; i < paths.length; i++) {
            showfiles(paths[i]);
        }       
}

public static void showfiles(File dir) {
    try {
        File[] files = dir.listFiles();
        for (File file : files) {
            if (file.isDirectory()) {
                System.out.println("Directory:" + file.getCanonicalPath());
                showfiles(file);
            } else {
                System.out.println("File:" + file.getCanonicalPath());
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
}

enter image description here

import pandas as pd
def index_duplicates_with_same_index(df, index, column_name):
    return df[df[column_name]==index]
df = pd.DataFrame([['hakan',  'özdemir', 0], ['hasan',  'yaman', 1],['naci',  'şenli', 2],['naciye',  'şen', 2]], columns = ['ad','soyad','similarity_index'])
print(df)

enter image description here