Question

我有一个excel文件，我正在读取类似于此的

name        size    color   material        size    color   material    size    color   material
bob         m       red     coton           m         yellow  cotton      m         green   dri-fit
james       l       green   dri-fit         l         green   cotton      l         red     cotton
steve       l       green   dri-fit         l         green   cotton      l         red     cotton

我想将我所有的衬衫类型汇总成这样

l green dri-fit   2
l red   coton     2
m red   coton     1

我正在使用pandas ExcelFile将文件读入文件对象，然后使用parse将工作表解析为数据框。

import pandas as pd
file = pd.ExcelFile('myexcelfile.xlsx')
df = file.parse('sheet1')

要尝试获得所需的输出，我尝试使用“宽到长”。问题是，因为我的某些列具有相同的名称，所以当我将文件读入pandas时，它会重命名我的列。例如，第二个size实例会自动变成size.2，与颜色和材质相同。如果我尝试使用从宽到长的存根名称，它将抱怨大小的第一个实例……“存根名称不能与列名相同”。

在大熊猫重命名我的专栏之前，有什么方法可以使用广泛吗？

Answer 1

列编号对于pd.wide_to_long来说是有问题的，因此我们需要修改列名称的第一个实例，并添加一个.0，以免与存根冲突。

样本数据

import pandas as pd
df = pd.read_clipboard() 
print(df)

    name size  color material size.1 color.1 material.1 size.2 color.2 material.2
0    bob    m    red    coton      m  yellow     cotton      m   green    dri-fit
1  james    l  green  dri-fit      l   green     cotton      l     red     cotton
2  steve    l  green  dri-fit      l   green     cotton      l     red     cotton

代码：

stubs = ['size', 'color', 'material']
d = {x: f'{x}.0' for x in stubs}
df.columns = [d.get(k, k) for k in df.columns]

res = pd.wide_to_long(df, i='name', j='num', sep='.', stubnames=stubs)
#          size   color material
#name  num                      
#bob   0      m     red    coton
#james 0      l   green  dri-fit
#steve 0      l   green  dri-fit
#bob   1      m  yellow   cotton
#james 1      l   green   cotton
#steve 1      l   green   cotton
#bob   2      m   green  dri-fit
#james 2      l     red   cotton
#steve 2      l     red   cotton

res.groupby([*res]).size()
#size  color   material
#l     green   cotton      2
#              dri-fit     2
#      red     cotton      2
#m     green   dri-fit     1
#      red     coton       1
#      yellow  cotton      1

Answer 2

`value_counts`

cols = ['size', 'color', 'material']
s = pd.value_counts([*zip(*map(np.ravel, map(df.get, cols)))])

(l, red, cotton)       2
(l, green, cotton)     2
(l, green, dri-fit)    2
(m, green, dri-fit)    1
(m, yellow, cotton)    1
(m, red, coton)        1
dtype: int64

`Counter`

还有更多我喜欢的

from collections import Counter

s = pd.Series(Counter([*zip(*map(np.ravel, map(df.get, cols)))]))
s.rename_axis(['size', 'color', 'material']).reset_index(name='freq')

  size   color material  freq
0    m     red    coton     1
1    m  yellow   cotton     1
2    m   green  dri-fit     1
3    l   green  dri-fit     2
4    l   green   cotton     2
5    l     red   cotton     2

Answer 3

以下代码：

df = pd.read_excel('C:/Users/me/Desktop/sovrflw_data.xlsx')
df.drop('name', axis=1, inplace=True)
arr = df.values.reshape(-1, 3)
df2 = pd.DataFrame(arr, columns=['size','color','material'])
df2['count']=1
df2.groupby(['size','color','material'],as_index=False).count()

阻止熊猫重命名具有相同名称的列，以便我可以使用从宽到长

3 个答案:

样本数据

代码：

`value_counts`

`Counter`