我正在尝试确定索引中的哪些时间戳有重复。我想创建一个时间戳字符串列表。我想为每个有重复的时间戳返回一个时间戳。如果可能的话。
#required packages
import os
import pandas as pd
import numpy as np
import datetime
# create sample time series
header = ['A','B','C','D','E']
period = 5
cols = len(header)
dates = pd.date_range('1/1/2000', periods=period, freq='10min')
dates2 = pd.date_range('1/1/2022', periods=period, freq='10min')
df = pd.DataFrame(np.random.randn(period,cols),index=dates,columns=header)
df0 = pd.DataFrame(np.random.randn(period,cols),index=dates2,columns=header)
df1 = pd.concat([df]*3) #creates duplicate entries by copying the dataframe
df1 = pd.concat([df1, df0])
df2 = df1.sample(frac=1) #shuffles the dataframe
df3 = df1.sort_index() #sorts the dataframe by index
print(df2)
#print(df3)
# Identifying duplicated entries
df4 = df2.duplicated()
print(df4)
然后我想使用列表调出每个时间戳的所有重复条目。从上面的代码中,有一种很好的方法可以调用与bool类型相关的索引吗?
编辑:添加了一个额外的数据框来创建一些独特的值,并将第一个数据框增加三倍以创建多个重复。还为问题添加了更多细节。
答案 0 :(得分:1)
IIUC:
df4[~df4]
输出:
2000-01-01 00:10:00 False
2000-01-01 00:00:00 False
2000-01-01 00:40:00 False
2000-01-01 00:30:00 False
2000-01-01 00:20:00 False
dtype: bool
时间戳列表
df4[~df4].index.tolist()
输出:
[Timestamp('2000-01-01 00:10:00'),
Timestamp('2000-01-01 00:00:00'),
Timestamp('2000-01-01 00:40:00'),
Timestamp('2000-01-01 00:30:00'),
Timestamp('2000-01-01 00:20:00')]
答案 1 :(得分:1)
In [46]: df2.drop_duplicates()
Out[46]:
A B C D E
2000-01-01 00:00:00 0.932587 -1.508587 -0.385396 -0.692379 2.083672
2000-01-01 00:40:00 0.237324 -0.321555 -0.448842 -0.983459 0.834747
2000-01-01 00:20:00 1.624815 -0.571193 1.951832 -0.642217 1.744168
2000-01-01 00:30:00 0.079106 -1.290473 2.635966 1.390648 0.206017
2000-01-01 00:10:00 0.760976 0.643825 -1.855477 -1.172241 0.532051
In [47]: df2.drop_duplicates().index.tolist()
Out[47]:
[Timestamp('2000-01-01 00:00:00'),
Timestamp('2000-01-01 00:40:00'),
Timestamp('2000-01-01 00:20:00'),
Timestamp('2000-01-01 00:30:00'),
Timestamp('2000-01-01 00:10:00')]