Question

我遇到了一个古怪的问题。我有一个类似下面的数据框。它已从csv文件中读取。

AgeGroups     Factor Cancer  Frequency
0        0_5  wo-statin    Yes          0
1       6_10  wo-statin    Yes          0
2      11_15  wo-statin    Yes          1
3      16_20  wo-statin    Yes          1
4      21_25  wo-statin    Yes         23
5      26_30  wo-statin    Yes         50
6      31_35  wo-statin    Yes         70
7      36_40  wo-statin    Yes        107
8      41_45  wo-statin    Yes        168
9      46_50  wo-statin    Yes        412
10     51_55  wo-statin    Yes        503
11     56_60  wo-statin    Yes        646
12     61_65  wo-statin    Yes        635
13     66_70  wo-statin    Yes        725
14     71_75  wo-statin    Yes        771
15     76_80  wo-statin    Yes        421
16     81_85  wo-statin    Yes        181
17     86_90  wo-statin    Yes         57
18     91_95  wo-statin    Yes          4
19    96_100  wo-statin    Yes          4
..
60       0_5   w-statin    Yes          0
61      6_10   w-statin    Yes          0
62     11_15   w-statin    Yes          0
63     16_20   w-statin    Yes          0
64     21_25   w-statin    Yes          0
65     26_30   w-statin    Yes          0
66     31_35   w-statin    Yes          0
67     36_40   w-statin    Yes          0
68     41_45   w-statin    Yes          0
69     46_50   w-statin    Yes         10
70     51_55   w-statin    Yes         17
71     56_60   w-statin    Yes         24
72     61_65   w-statin    Yes         50
73     66_70   w-statin    Yes        113
74     71_75   w-statin    Yes        198
75     76_80   w-statin    Yes        105
76     81_85   w-statin    Yes         37
77     86_90   w-statin    Yes         18
78     91_95   w-statin    Yes          2
79    96_100   w-statin    Yes          0

我想从以下代码中的条形图中进行一些统计分析：

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('file:///C:/Users/out.CSV')
ages= df.AgeGroups.unique()

grp = df.groupby(['AgeGroups','Factor','Cancer']).Frequency.sum()

counts = grp.unstack(level=[2])
by_factor = counts.groupby(level='Factor')


k = by_factor.ngroups

fig, axes = plt.subplots(1, k, sharex=True, sharey=True, figsize=(15, 8))
for i, (gname, grp) in enumerate(by_factor):
    grp.xs(gname, level='Factor').plot.bar(
        stacked=True, rot=45, ax=axes[i], title=gname)
_ = axes.set_xticklabels(grp['AgeGroups'])
#for ax in axes: ax.set_ylim([0,25000])
fig.tight_layout()

我几乎非常高兴所有事情都顺利进行，直到我意识到我的情节中的6_10岁年龄组来到了剧情的中间。

如果我手动更改表格，例如0_5为00-05，6_10为06-10，依此类推，图表显示正确的顺序。但问题是，我有大约2k这样的CSV，几乎不可能手动编辑每个表的顺序。如果需要，我可以提供更多信息，非常感谢您的帮助。：）

Answer 1

替换文件中的字符串可能是值得的。这可以按如下方式完成：

import glob as glob

files = glob.glob("data/*.csv")

for filename in files:
    # Read in the file
    with open(filename, 'r') as f :
        filedata = f.read()

    # Replace the target string
    filedata = filedata.replace('  0_5', '00-05')
    filedata = filedata.replace(' 6_10', '06_10')

    # Write the file out again
    with open(filename, 'w') as f:
        f.write(filedata)

如果由于某种原因需要原件，请不要忘记备份前的文件。

Answer 2

您可能希望将{{1}}添加到{{1}}列（正如@ImportanceOfBeingErnest已经建议的那样）。

但我会用熊猫的方式做到这一点：

{{1}}

结果：

{{1}}

x轴标签没有按正确顺序排序（matplotlib / pandas）

2 个答案: