Question

我想知道是否有任何方式可以使用类似于分层索引的功能，但是在pandas表的数据中。我有兴趣将几个数据帧组合成一个数据帧，其中一些数据帧在另一个数据帧中有一个ID的多个条目。

与往常一样，最好只显示结构。这是一个简化的数据帧1：

p-table

虽然数据帧2可能具有与数据帧1的每个条目（通过索引ID）相对应的若干属性：

>>> df1
   id             txt
0   0      first sent
1   1     another one
2   2     I think you
3   3  will like this
4   4       will work

所以尝试这个：

>>> df2
   attr  id
0  chem   0
1   dis   0
2  chem   1
3  chem   1
4  chem   2
5   dis   2
6   dis   3
7   dis   3
8   dis   4
9  chem   4

合并时只提供：

import pandas as pd
id = range(0,5)
texts =  ['first sent', 'another one', 'I think you', 'will like this']
df = pd.DataFrame({'txt':texts, 'id':id})
df2 = pd.DataFrame({'attr':['chem', 'dis', 'chem', 'chem', 'chem', 'dis', 'dis', 'dis', 'dis', 'chem'] ,'id':[0,0,1,1,2,2,3,3,4,4]})

现在你可以看到＆＃39; txt＆＃39;列是重复的 - 在这种情况下，IMO是不必要的，如果>>> df.merge(df2, on='id') id txt attr 0 0 first sent chem 1 0 first sent dis 2 1 another one chem 3 1 another one chem 4 2 I think you chem 5 2 I think you dis 6 3 will like this dis 7 3 will like this dis 8 4 will work dis 9 4 will work chem中的每个id的属性很多，则可能会导致内存出现严重问题。您可能有（在这种情况下）文本数据重复数千倍于将数据表示为两个独立数据帧所需的数据。

我想过尝试制作＆＃39; txt＆＃39;列是分层索引的索引（尽管我确信这完全是错误的设计考虑因素），但即使仍然存在重复索引。

df2

有没有办法将信息存储在一个数据框中？

Answer 1

这是一个使用pandas categories的内存有效解决方案。对于＆lt; txt＆＃39;中的每个值，成本现在只是一个整数。结果中的列，比存储文本字符串便宜得多。

import pandas as pd

ids = range(0,4)
texts =  ['first sent', 'another one', 'I think you', 'will like this']

df = pd.DataFrame({'txt':texts, 'id':ids})
df2 = pd.DataFrame({'attr':['chem', 'dis', 'chem', 'chem', 'chem', 'dis', 'dis', 'dis', 'dis', 'chem'] ,'id':[0,0,1,1,2,2,3,3,4,4]})

# convert to category codes and store mapping
df['txt'] = df['txt'].astype('category')
df_txt_cats = dict(enumerate(df['txt'].cat.categories))
df['txt'] = df['txt'].cat.codes

# perform merge - memory efficinet since result only uses integers
df_merged = df.merge(df2, on='id')

# rename categories from integers to text strings from previously stored mapping
df_merged['txt'] = df_merged['txt'].astype('category')
df_merged['txt'].cat.categories = list(map(df_txt_cats.get, df_merged['txt'].cat.categories))

df_merged.dtypes
# id         int32
# txt     category
# attr      object
# dtype: object

Answer 2

您的第二个选择是更高效的内存。原因是您最终会得到一个多索引，并且实际文本值不会在内存中重复。它们仅在输出表示中显示为重复。如果您查看实际DataFrame的merge_2.index输出，则可以看到没有重复。

<强>演示：

# I've added some extra dummy text to show how this works with larger strings
extra_txt = ",".join([str(i) for i in range(5000)])

import pandas as pd
id = range(0,5)
texts =  [
    'first sent' + extra_txt, 
    'another one' + extra_txt,
    'I think you' + extra_txt,
    'will like this' + extra_txt,
    'will work' + extra_txt,
]
df = pd.DataFrame({'txt':texts, 'id':id})
df2 = pd.DataFrame({'attr':['chem', 'dis', 'chem', 'chem', 'chem', 'dis', 'dis', 'dis', 'dis', 'chem']     ,'id':[0,0,1,1,2,2,3,3,4,4]})

merge_1 = df.merge(df2, on='id')
merge_2 = df.merge(df2, on='id').set_index(['id', 'txt'])

版本1内存使用情况：

In []: merge_1.memory_usage(index=True, deep=True).sum()
Out[]: 240335

第2版内存使用情况：

In []: merge_2.memory_usage(index=True, deep=True).sum()
Out[]: 120565

加入pandas数据帧中的关系表的层次结构

2 个答案: