Question

您好我正在尝试找到两个系列之间的区别，但它返回一个空列表。

new_main_file = pd.read_excel('result_concat.xlsx', encoding='utf-8')
new_main_file.Title.count()#=> 11 620
len(new_main_file.Title.unique())#=> 10 436

#Difference
pd.Series(list(set(new_main_file.Title) - set(new_main_file.Title.unique())))
#Series([], dtype: float64)

我想找到哪个标题一式两份

Answer 1

set()和.unique()执行相同的操作set(df.col) = set(df.col.unique())，这就是返回None的原因。

import pandas as pd
import numpy as np

# data
# ========================================================
np.random.seed(0)
df = pd.DataFrame(np.random.choice(list('abcdefghigk'), size=20), columns=['col'])
df

   col
0    f
1    a
2    d
3    d
4    h
5    g
6    d
7    f
8    c
9    e
10   h
11   g
12   i
13   i
14   k
15   b
16   g
17   h
18   h
19   i


df['col'].count()  # output 20
len(df['col'].unique())  # output 10
set(df.col)
# output {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k'}
set(df.col.unique())
# output {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k'}
set(df.col) - set(df.col.unique())
# output set()


# processing
# ======================================================
res = df['col'].value_counts()

h    4
i    3
d    3
g    3
f    2
b    1
k    1
c    1
e    1
a    1
dtype: int64

# duplicated titles
res.index[res>1].tolist()

['h', 'i', 'd', 'g', 'f']

熊猫：两个系列之间的差异返回无

1 个答案: