我有一个关于错误停车罚款的csv文件,它包含月份,年份和罚款的原因。我想找到获得罚款的十大理由(错误部分/主要原因)。
请注意,Error section / main cause
列中的某些行有两个不同的原因可以获得罚款(0401 Parking Prohibited Area failure to comply with a traffic sign ; 2200 Parking next to the marked parking space
)
代码需要很长时间才能响应然后出错。(长列表)
import pandas as pd
from StringIO import StringIO
df = pd.read_csv('Parkingfines.csv', parse_dates=True,
index_col="Month of the error", usecols=["Month of the error",
"Year of the error", "Error section / main cause"],
names=["Month of the error", "Year of the error", "Error section / main cause"], header=0)
df = df['Error section / main cause'].agg(['count'])
然后绘制每月罚款数量的图表(从2014年1月到最新数据)。但是,这部分给出了ValueError:未知的字符串格式
counts_per_month = df.groupby(by=['Year of the error',
'Month of the error', ]).agg('count')
counts_per_month.index = pd.to_datetime(
[' '.join(map(str, col)).strip() for col in counts_per_month.index.values]
)
# flatten multiindex and convert to datetime
counts_per_month.plot()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\Users\Dream\Anaconda3\lib\site-packages\pandas\tseries\tools.py in _convert_listlike(arg, box, format, name)
408 try:
--> 409 values, tz = tslib.datetime_to_datetime64(arg)
410 return DatetimeIndex._simple_new(values, name=name, tz=tz)
pandas\tslib.pyx in pandas.tslib.datetime_to_datetime64 (pandas\tslib.c:29768)()
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-9-81c15e474539> in <module>()
50 'Month of the error', ]).agg('count')
51 counts_per_month.index = pd.to_datetime(
---> 52 [' '.join(map(str, col)).strip() for col in counts_per_month.index.values]
53 )
54 # flatten multiindex and convert to datetime
C:\Users\Dream\Anaconda3\lib\site-packages\pandas\util\decorators.py in wrapper(*args, **kwargs)
89 else:
90 kwargs[new_arg_name] = new_arg_value
---> 91 return func(*args, **kwargs)
92 return wrapper
93 return _deprecate_kwarg
C:\Users\Dream\Anaconda3\lib\site-packages\pandas\tseries\tools.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, box, format, exact, coerce, unit, infer_datetime_format)
289 yearfirst=yearfirst,
290 utc=utc, box=box, format=format, exact=exact,
--> 291 unit=unit, infer_datetime_format=infer_datetime_format)
292
293
C:\Users\Dream\Anaconda3\lib\site-packages\pandas\tseries\tools.py in _to_datetime(arg, errors, dayfirst, yearfirst, utc, box, format, exact, unit, freq, infer_datetime_format)
425 return _convert_listlike(arg, box, format, name=arg.name)
426 elif com.is_list_like(arg):
--> 427 return _convert_listlike(arg, box, format)
428
429 return _convert_listlike(np.array([arg]), box, format)[0]
C:\Users\Dream\Anaconda3\lib\site-packages\pandas\tseries\tools.py in _convert_listlike(arg, box, format, name)
410 return DatetimeIndex._simple_new(values, name=name, tz=tz)
411 except (ValueError, TypeError):
--> 412 raise e
413
414 if arg is None:
C:\Users\Dream\Anaconda3\lib\site-packages\pandas\tseries\tools.py in _convert_listlike(arg, box, format, name)
396 yearfirst=yearfirst,
397 freq=freq,
--> 398 require_iso8601=require_iso8601
399 )
400
pandas\tslib.pyx in pandas.tslib.array_to_datetime (pandas\tslib.c:41972)()
pandas\tslib.pyx in pandas.tslib.array_to_datetime (pandas\tslib.c:41577)()
pandas\tslib.pyx in pandas.tslib.array_to_datetime (pandas\tslib.c:41466)()
pandas\tslib.pyx in pandas.tslib.parse_datetime_string (pandas\tslib.c:31806)()
C:\Users\Dream\Anaconda3\lib\site-packages\dateutil\parser.py in parse(timestr, parserinfo, **kwargs)
1162 return parser(parserinfo).parse(timestr, **kwargs)
1163 else:
-> 1164 return DEFAULTPARSER.parse(timestr, **kwargs)
1165
1166
C:\Users\Dream\Anaconda3\lib\site-packages\dateutil\parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
553
554 if res is None:
--> 555 raise ValueError("Unknown string format")
556
557 if len(res) == 0:
ValueError: Unknown string format
答案 0 :(得分:1)
首先,您的文件有点损坏:应合并以下两行:
255121 October;;
255122 ;2014;0701 Parking without p-recognized / p-unit / p-ticket
然后您的文件似乎是Latin-1编码。默认情况下,Python 3假定所有文件都是UTF-8,Python 2假定它们是ASCII,因此您必须明确告知您的文件是Latin-1。
df = pd.read_csv('~/dl/parkingfines-2.csv', sep=';',
encoding='latin-1', header=0)
另请注意,正如David Garwin所提到的,您的分隔符为;
,而不是,
(默认),因此您必须明确提供。不需要传递给pd.read_csv
的其他参数:列名将从文件的第一行获得。
然后我们必须解决一些罚款有多个原因的问题。这可以用不同的方式处理。例如,我们可以用几条记录替换这些记录(每个原因都有一条新记录)。这可以通过以下方式完成:
# there are rows without causes, let's drop them
df.dropna(inplace=True)
# save index to use it later
df['idx'] = df.index
# find all rows for which cause contains ';' (this means several
# causes presented)
multiples_mask = df['Error section / main cause'].str.contains(';')
multiples = df[multiples_mask]
# split each cause with ';' as separator
splitted = multiples['Error section / main cause'].str.split(';')
# create new dataframe
duplicated = []
for (i, row), spl in zip(multiples.iterrows(), splitted):
for cause in spl:
duplicated.append([row['Month of the error'],
row['Year of the error'],
cause.strip(), i])
# combine the part of dataframe that contains only single causes
# with created new dataframe for several causes
df_with_dupes = pd.concat(
[df[~ multiples_mask],
pd.DataFrame(duplicated, columns=df.columns)], ignore_index=True)
# sort with idx
df_with_dupes.sort_values(by='idx', inplace=True)
df = df_with_dupes
# drop idx: we do not need it more
df.drop('idx', axis=1, inplace=True)
现在我们可以解决您的问题了。至于你的第一个问题,找到最常见的罚款原因,以下代码有效:
causes_counts = df['Error section / main cause'].value_counts()
causes_counts.sort_values(ascending=False, inplace=True)
print(causes_counts.head(10))
正如JohnE在评论中提到的那样,您必须使用value_counts()
而不是agg()
。另请注意,在您的代码中,您尝试将所有数据框替换为此命令的结果(df = df['Error section / main cause'].agg(['count'])
表示将df
替换为右侧计算结果) 。很明显,在您这样做之后,您丢失了初始数据帧,因此无法在以下行中访问它。所以我使用了不同的变量名来存储计数结果。
至于您的第二个问题,以下代码有效:
counts_per_month = df.groupby(by=['Year of the error',
'Month of the error', ]).agg('count')
counts_per_month.index = pd.to_datetime(
[' '.join(map(str, col)).strip() for col in counts_per_month.index.values]
)
# flatten multiindex and convert to datetime
counts_per_month.plot()
答案 1 :(得分:0)
此行会覆盖df
df = df['Error section / main cause'].agg(['count'])
此行占用一列并对其进行分组。
df = df['Month of the error'].groupby(df['Year of the error']).agg(['count'])
十大理由应该是:
df.reasons.value_counts()
每月的罚款金额应为:
df.groupby("month").size()