使用pandas计数(同一行中有两个不同的数字)

时间:2016-11-12 00:01:57

标签: python pandas

我有一个关于错误停车罚款的csv文件,它包含月份,年份和罚款的原因。我想找到获得罚款的十大理由(错误部分/主要原因)。

请注意,Error section / main cause列中的某些行有两个不同的原因可以获得罚款(0401 Parking Prohibited Area failure to comply with a traffic sign ; 2200 Parking next to the marked parking space

代码需要很长时间才能响应然后出错。(长列表)

import pandas as pd
from StringIO import StringIO

df = pd.read_csv('Parkingfines.csv', parse_dates=True, 
                 index_col="Month of the error", usecols=["Month of the error", 
                 "Year of the error", "Error section / main cause"], 
                 names=["Month of the error", "Year of the error", "Error section / main cause"], header=0)

df = df['Error section / main cause'].agg(['count'])

Link for the csv file

然后绘制每月罚款数量的图表(从2014年1月到最新数据)。但是,这部分给出了ValueError:未知的字符串格式

counts_per_month = df.groupby(by=['Year of the error', 
                                  'Month of the error', ]).agg('count')
counts_per_month.index = pd.to_datetime(
    [' '.join(map(str, col)).strip() for col in counts_per_month.index.values]
)
# flatten multiindex and convert to datetime

counts_per_month.plot()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\Users\Dream\Anaconda3\lib\site-packages\pandas\tseries\tools.py in _convert_listlike(arg, box, format, name)
    408             try:
--> 409                 values, tz = tslib.datetime_to_datetime64(arg)
    410                 return DatetimeIndex._simple_new(values, name=name, tz=tz)

pandas\tslib.pyx in pandas.tslib.datetime_to_datetime64 (pandas\tslib.c:29768)()

TypeError: Unrecognized value type: <class 'str'>

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-9-81c15e474539> in <module>()
     50                                   'Month of the error', ]).agg('count')
     51 counts_per_month.index = pd.to_datetime(
---> 52     [' '.join(map(str, col)).strip() for col in counts_per_month.index.values]
     53 )
     54 # flatten multiindex and convert to datetime

C:\Users\Dream\Anaconda3\lib\site-packages\pandas\util\decorators.py in wrapper(*args, **kwargs)
     89                 else:
     90                     kwargs[new_arg_name] = new_arg_value
---> 91             return func(*args, **kwargs)
     92         return wrapper
     93     return _deprecate_kwarg

C:\Users\Dream\Anaconda3\lib\site-packages\pandas\tseries\tools.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, box, format, exact, coerce, unit, infer_datetime_format)
    289                         yearfirst=yearfirst,
    290                         utc=utc, box=box, format=format, exact=exact,
--> 291                         unit=unit, infer_datetime_format=infer_datetime_format)
    292 
    293 

C:\Users\Dream\Anaconda3\lib\site-packages\pandas\tseries\tools.py in _to_datetime(arg, errors, dayfirst, yearfirst, utc, box, format, exact, unit, freq, infer_datetime_format)
    425         return _convert_listlike(arg, box, format, name=arg.name)
    426     elif com.is_list_like(arg):
--> 427         return _convert_listlike(arg, box, format)
    428 
    429     return _convert_listlike(np.array([arg]), box, format)[0]

C:\Users\Dream\Anaconda3\lib\site-packages\pandas\tseries\tools.py in _convert_listlike(arg, box, format, name)
    410                 return DatetimeIndex._simple_new(values, name=name, tz=tz)
    411             except (ValueError, TypeError):
--> 412                 raise e
    413 
    414     if arg is None:

C:\Users\Dream\Anaconda3\lib\site-packages\pandas\tseries\tools.py in _convert_listlike(arg, box, format, name)
    396                     yearfirst=yearfirst,
    397                     freq=freq,
--> 398                     require_iso8601=require_iso8601
    399                 )
    400 

pandas\tslib.pyx in pandas.tslib.array_to_datetime (pandas\tslib.c:41972)()

pandas\tslib.pyx in pandas.tslib.array_to_datetime (pandas\tslib.c:41577)()

pandas\tslib.pyx in pandas.tslib.array_to_datetime (pandas\tslib.c:41466)()

pandas\tslib.pyx in pandas.tslib.parse_datetime_string (pandas\tslib.c:31806)()

C:\Users\Dream\Anaconda3\lib\site-packages\dateutil\parser.py in parse(timestr, parserinfo, **kwargs)
   1162         return parser(parserinfo).parse(timestr, **kwargs)
   1163     else:
-> 1164         return DEFAULTPARSER.parse(timestr, **kwargs)
   1165 
   1166 

C:\Users\Dream\Anaconda3\lib\site-packages\dateutil\parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    553 
    554         if res is None:
--> 555             raise ValueError("Unknown string format")
    556 
    557         if len(res) == 0:

ValueError: Unknown string format

2 个答案:

答案 0 :(得分:1)

首先,您的文件有点损坏:应合并以下两行:

255121 October;;
255122 ;2014;0701 Parking without p-recognized / p-unit / p-ticket

然后您的文件似乎是Latin-1编码。默认情况下,Python 3假定所有文件都是UTF-8,Python 2假定它们是ASCII,因此您必须明确告知您的文件是Latin-1。

df = pd.read_csv('~/dl/parkingfines-2.csv', sep=';', 
                 encoding='latin-1', header=0)

另请注意,正如David Garwin所提到的,您的分隔符为;,而不是,(默认),因此您必须明确提供。不需要传递给pd.read_csv的其他参数:列名将从文件的第一行获得。

然后我们必须解决一些罚款有多个原因的问题。这可以用不同的方式处理。例如,我们可以用几条记录替换这些记录(每个原因都有一条新记录)。这可以通过以下方式完成:

# there are rows without causes, let's drop them
df.dropna(inplace=True)

# save index to use it later
df['idx'] = df.index

# find all rows for which cause contains ';' (this means several 
# causes presented)
multiples_mask = df['Error section / main cause'].str.contains(';')
multiples = df[multiples_mask]

# split each cause with ';' as separator
splitted = multiples['Error section / main cause'].str.split(';')

# create new dataframe
duplicated = []
for (i, row), spl in zip(multiples.iterrows(), splitted):
    for cause in spl:
        duplicated.append([row['Month of the error'], 
                           row['Year of the error'],
                           cause.strip(), i])

# combine the part of dataframe that contains only single causes
# with created new dataframe for several causes
df_with_dupes = pd.concat(
    [df[~ multiples_mask],
     pd.DataFrame(duplicated, columns=df.columns)], ignore_index=True)

# sort with idx
df_with_dupes.sort_values(by='idx', inplace=True)
df = df_with_dupes

# drop idx: we do not need it more
df.drop('idx', axis=1, inplace=True)

现在我们可以解决您的问题了。至于你的第一个问题,找到最常见的罚款原因,以下代码有效:

causes_counts = df['Error section / main cause'].value_counts()
causes_counts.sort_values(ascending=False, inplace=True)
print(causes_counts.head(10))

正如JohnE在评论中提到的那样,您必须使用value_counts()而不是agg()。另请注意,在您的代码中,您尝试将所有数据框替换为此命令的结果(df = df['Error section / main cause'].agg(['count'])表示df替换为右侧计算结果) 。很明显,在您这样做之后,您丢失了初始数据帧,因此无法在以下行中访问它。所以我使用了不同的变量名来存储计数结果。

至于您的第二个问题,以下代码有效:

counts_per_month = df.groupby(by=['Year of the error', 
                                  'Month of the error', ]).agg('count')
counts_per_month.index = pd.to_datetime(
    [' '.join(map(str, col)).strip() for col in counts_per_month.index.values]
)
# flatten multiindex and convert to datetime

counts_per_month.plot()

counts

答案 1 :(得分:0)

此行会覆盖df

 df = df['Error section / main cause'].agg(['count'])

此行占用一列并对其进行分组。

 df = df['Month of the error'].groupby(df['Year of the error']).agg(['count'])

十大理由应该是:

 df.reasons.value_counts()

每月的罚款金额应为:

 df.groupby("month").size()