我遇到了一个" ValueError"在文件" ratings.dat"上运行下面的代码。我在","的另一个文件上尝试了相同的代码。作为分隔符没有任何问题。但是,当分隔符是" ::"时,大熊猫似乎失败了。
我输错了代码吗?
代码:
import pandas as pd
import numpy as np
r_cols = ['userId', 'movieId', 'rating']
r_types = {'userId': np.str, 'movieId': np.str, 'rating': np.float64}
ratings = pd.read_csv(
r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\'
r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\ratings.dat',
sep='::', names=r_cols, usecols=range(3), dtype=r_types
)
m_cols = ['movieId', 'title']
m_types = {'movieId': np.str, 'title': np.str}
movies = pd.read_csv(
r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\'
r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\movies.dat',
sep='::', names=m_cols, usecols=range(2), dtype=m_types
)
ratings = pd.merge(movies, ratings)
ratings.head()
" ratings.dat"
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368
1::595::5::978824268
错误输出:
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-19-a2649e528fb9> in <module>()
7 r'C:\\Users\\Admin\\OneDrive\\Documents\\_Learn!\\'
8 r'Learn Data Science\\Data Sets\\MovieLens\\ml-1m\\ratings.dat',
----> 9 sep='::', names=r_cols, usecols=range(3), dtype=r_types
10 )
11
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
496 skip_blank_lines=skip_blank_lines)
497
--> 498 return _read(filepath_or_buffer, kwds)
499
500 parser_f.__name__ = name
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
273
274 # Create the parser.
--> 275 parser = TextFileReader(filepath_or_buffer, **kwds)
276
277 if (nrows is not None) and (chunksize is not None):
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
584
585 # might mutate self.engine
--> 586 self.options, self.engine = self._clean_options(options, engine)
587 if 'has_index_names' in kwds:
588 self.options['has_index_names'] = kwds['has_index_names']
C:\Anaconda3\lib\site-packages\pandas\io\parsers.py in _clean_options(self, options, engine)
663 msg += " (Note the 'converters' option provides"\
664 " similar functionality.)"
--> 665 raise ValueError(msg)
666 del result[arg]
667
ValueError: Falling back to the 'python' engine because the 'c' engine does not support regex separators, but this causes 'dtype' to be ignored as it is not supported by the 'python' engine. (Note the 'converters' option provides similar functionality.)
答案 0 :(得分:3)
如果您仔细阅读了追溯的最后一行,您可能会得到它失败原因的答案。我把它吐成两行
ValueError:回到&#39; python&#39;引擎,因为&#39; c&#39;引擎不支持正则表达式分隔符,
但这会导致&#39; dtype&#39;被忽略,因为它不支持&#39; python&#39;发动机。 (注意&#39;转换器&#39;选项提供类似的功能。)
因此分隔符'::'
被解释为正则表达式。正如关于sep
的Pandas文档所说:
接受正则表达式,强制使用python解析引擎
(强调我的)
因此,Pandas将使用&#34; Python引擎&#34;阅读数据。然后,错误的下一行表示由于使用了Python引擎,dtype
被忽略。 (据推测,C-engine意味着numpy,可以使用dtype; Python显然不会处理dtypes。)
<小时/>
您可以从致电dtype
中删除read_csv
参数(您仍会收到警告),或对分隔符执行某些操作。
第二个选项似乎很棘手:转义或原始字符串没有帮助。显然,任何长度超过1个字符的分隔符都会被Pandas解释为正则表达式。对于熊猫方面来说,这可能是一个不幸的决定。
避免这一切的一种方法是使用单个':'
作为分隔符,并避免每隔一个(空)列。例如:
ratings = pd.read_csv(filename, sep=':', names=r_cols,
usecols=[0, 2, 4], dtype=r_types)
(如果您使用usecols=range(0, 5, 2)
设置,请使用range
。)
OP正确地提出了关于具有单个:
字符的字段的观点。也许有一种方法可以解决这个问题,但是否则你可以使用numpy&#39; s genfromtxt
来实现两步法:
# genfromtxt requires a proper numpy dtype, not a dict
# for Python 3, we need U10 for strings
dtype = np.dtype([('userId', 'U10'), ('movieID', 'U10'),
('rating', np.float64)])
data = np.genfromtxt(filename, dtype=dtype, names=r_cols,
delimiter='::', usecols=list(range(3)))
ratings = pd.DataFrame(data)