Question

我有一个pandas DataFrame，带有重复的已排序数字索引，并且列值对于给定列中索引的相同值是相同的。我想迭代给定列的值以获取索引的唯一值。

实施例

df = pd.DataFrame({'a': [3, 3, 5], 'b': [4, 6, 8]}, index=[1, 1, 2])

   a  b
1  3  4
1  3  6
2  5  8

我想遍历列a中的值，以查找索引中的唯一条目 - [3,5]。

当我使用默认的index进行迭代并打印列a的类型时，我会获得重复索引条目的系列条目。

for i in df.index:
    cell_value = df['a'].loc[i]
    print(type(cell_value))

输出：

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'numpy.int64'>

Answer 1

首先按掩码删除重复索引，然后按arange分配位置，然后选择iloc：

arr = np.arange(len(df.index))
a = arr[~df.index.duplicated()]
print (a)
[0 2]

for i in a:
    cell_value = df['a'].iloc[i]
    print(type(cell_value))

<class 'numpy.int64'>
<class 'numpy.int64'>

无循环解决方案 - boolean indexing使用duplicated，~使用反转掩码：

a = df.loc[~df.index.duplicated(), 'a']
print (a)
1    3
2    5
Name: a, dtype: int64

b = df.loc[~df.index.duplicated(), 'a'].tolist()
print (b)
[3, 5]

print (~df.index.duplicated())
[ True False  True]

Answer 2

尝试_, i = np.unique(df.index, return_index=True) df.iloc[i, df.columns.get_loc('a')].tolist() [3, 5]：

 class HangfireContextSink : ILogEventSink {
        private readonly IFormatProvider formatProvider;
        private readonly PerformContext context;
        public HangfireContextSink(IFormatProvider formatProvider, PerformContext context) {
            this.formatProvider = formatProvider;
            this.context = context;
        }
        public void Emit(LogEvent logEvent) {
            var message = logEvent.RenderMessage(formatProvider);
            context.WriteLine(ConsoleTextColor.Blue, DateTimeOffset.Now.ToString() + " " + message);
        }

Answer 3

如果根据您的评论，相同的索引意味着相同的数据，那么这似乎是XY Problem。

你也不需要循环。

假设您要删除重复的行并仅提取第一列（即3,5），则下面的内容就足够了。

res = df.drop_duplicates().loc[:, 'a']

# 1    3
# 2    5
# Name: a, dtype: int64

要返回类型：

types = list(map(type, res))

print(types)
# [<class 'numpy.int64'>, <class 'numpy.int64'>]

Answer 4

使用groupby和apply的另一种解决方案：

df.groupby(level=0).apply(lambda x: type(x.a.iloc[0]))
Out[330]: 
1    <class 'numpy.int64'>
2    <class 'numpy.int64'>
dtype: object

要使循环解决方案正常工作，请创建一个临时df：

df_new = df.groupby(level=0).first()
for i in df_new.index:
    cell_value = df_new['a'].loc[i]
    print(type(cell_value))

<class 'numpy.int64'>
<class 'numpy.int64'>

或者使用drop_duplicates（）

for i in df.drop_duplicates().index:
    cell_value = df.drop_duplicates()['a'].loc[i]
    print(type(cell_value))

<class 'numpy.int64'>
<class 'numpy.int64'>

如何在pandas中使用带有重复项的已排序数字索引迭代数据帧的唯一行的列值？

4 个答案: