我正在处理包含重复项的数据。如果行的“similarity_index”等于另一行,则表示它们是重复的。我正在尝试合并这些副本。
这是我的DataFrame:
ad soyad similarity_index
0 hakan özdemir 0
1 hasan yaman 1
2 naci şenli 2
3 naciye şen 2
4 osman uygur 3
5 elif sözen 4
6 irem derici 5
这是我尝试做的事情:
test_df.set_index("similarity_index").sort_index()
这是输出:
ad soyad
similarity_index
0 hakan özdemir
0 hakan utku özdemir
1 hasan yaman
2 naci şenli
2 naciye şen
3 osman uygur
4 elif sözen
5 irem derici
5 irem delici
6 hako özdemir
这就是我想要的:
ad soyad
similarity_index
0 hakan özdemir
hakan utku özdemir
1 hasan yaman
2 naci şenli
naciye şen
3 osman uygur
4 elif sözen
5 irem derici
irem delici
6 hako özdemir
有了这个,我试图完成选择具有相同索引的重复行。我尝试了groupby()
和pivot_table()
。但我找不到合适的方法。
答案 0 :(得分:1)
您想要的实际上是熊猫默认索引功能的自定义功能。
public class Test {
public static void main(String[] args) throws IOException {
File[] paths = File.listRoots();
for (int i = 0; i < paths.length; i++) {
showfiles(paths[i]);
}
}
public static void showfiles(File dir) {
try {
File[] files = dir.listFiles();
for (File file : files) {
if (file.isDirectory()) {
System.out.println("Directory:" + file.getCanonicalPath());
showfiles(file);
} else {
System.out.println("File:" + file.getCanonicalPath());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
import pandas as pd
def index_duplicates_with_same_index(df, index, column_name):
return df[df[column_name]==index]
df = pd.DataFrame([['hakan', 'özdemir', 0], ['hasan', 'yaman', 1],['naci', 'şenli', 2],['naciye', 'şen', 2]], columns = ['ad','soyad','similarity_index'])
print(df)