如何在pandas

时间:2017-11-02 15:23:43

标签: python pandas csv matplotlib

我需要在pandas中使用变量作为图形的标题。我有一个csv文件,从那里我根据主csv文件中的resource_id创建了多个csv文件以及图形。

我的csv文件中的示例内容:

Access_Stat_ID,Resource_ID,Range_Start,Range_End,Name,Format,Number,Matched_URL
6890859,10020,"2014-05-01 00:00:00","2014-05-31 23:59:59","May 2014","html",89,"/dissertationen/biologie/behrend-anke/HTML/behrend-vita.html"
6890860,10021,"2014-05-01 00:00:00","2014-05-31 23:59:59","May 2014","pdf",30,"/dissertationen/biologie/dreier-lars/PDF/Dreier.pdf"
6890861,10021,"2014-05-01 00:00:00","2014-05-31 23:59:59","May 2014","entry",2,"?"
6890862,10021,"2014-05-01 00:00:00","2014-05-31 23:59:59","May 2014","html",11,"/dissertationen/biologie/dreier-lars/HTML/chapter4.html"

这是我的代码:

df = pd.read_csv('dbo.Access_Stat_all.csv',error_bad_lines=False, usecols=['Range_Start','Format','Resource_ID','Number'])
uniquevalues = np.unique(df[['Resource_ID']].values)

for resource_id in uniquevalues:
    df1 = df[df['Resource_ID'] == resource_id]
    df1 = df1[['Format', 'Range_Start', 'Number']]
    #truncate the date to only take month and year
    df1["Range_Start"] = df1["Range_Start"].str[:7]
    df1 = df1.groupby(['Format', 'Range_Start'], as_index=True).last()
    pd.options.display.float_format = '{:,.0f}'.format
    df1 = df1.unstack()
    df1.columns = df1.columns.droplevel()
    if df1.index.contains('entry'):
        df2 = df1[1:4].sum(axis=0)
    else:
        df2 = df1[0:3].sum(axis=0)
    df2.name = 'sum'
    df2 = df1.append(df2)
    df2.to_csv('csv_files/' + str(resource_id) + '.csv', sep="\t", float_format='%.f')
    if df2.index.contains('entry'):
        df3 = df2.T[['entry', 'sum']].copy()
    else:
        df3 = df2.T[['sum']].copy()

    # convert index to use pandas datetime format
    df3.index = pd.to_datetime(df3.index)

    # plot the data
    fig, ax = plt.subplots()
    plt.xticks(rotation=90)

    # use matplotlib date formatters
    years = mdates.YearLocator()  # every year
    yearsFmt = mdates.DateFormatter('%Y-%m')

    # format the major ticks
    ax.xaxis.set_major_locator(years)
    ax.xaxis.set_major_formatter(yearsFmt)

    ax.plot(df3)
    ax.legend(["Seitenzugriffe", "Dateiabrufe"])
    plt.tight_layout()
    xtl = [item.get_text()[:4] for item in ax.get_xticklabels()]
    ax.set_xticklabels(xtl)
    fig.savefig('plots/'+ str(resource_id) + '.png')
    plt.close('all')

现在在图/图中,我想要特定的resource_id和range_start作为标题。我该怎么做?

1 个答案:

答案 0 :(得分:0)

首先,您已在resource_id循环中定义for,对吗?因此,在构建绘图时可以将其用作变量:

plt.title(resource_id)

for循环的每次迭代都会产生不同的标题。使用您提供的数据集,在第一次迭代中,resource_id应该等于10020,然后10021,因此将创建/保存两个图。如果您不清楚,请查看Matplotlib tutorial以获取设置标题的更多示例。

其次,对于"Range Start",您的数据框是子集,因此它只包含相关的resource_id,然后遍历每个Range Start值:

uniquevalues = np.unique(df[['Resource_ID']].values)

for resource_id in uniquevalues:
    df1 = df[df['Resource_ID'] == resource_id]
    df1 = df1[['Format', 'Range_Start', 'Number']]
    #truncate the date to only take month and year
    df1["Range_Start"] = df1["Range_Start"].str[:7]
    unique_range_starts = np.unique(df["Range Start"].values)
    for range_start in unique_range_start:
          # all your code to construct the graph goes here....

现在,每个标题标题都有resource_idrange_start作为变量:

plt.title(resource_id + range_start)