我有两个CSV文件(每个大约4GB),我想检查这两个文件的条目之间的区别。
假设1.csv中的Row1条目与2.csv的row1不匹配,但与2.csv的第100行相同,那么它不应该显示任何差异。
只有当两个CSV文件中没有相同的条目时才能看到差异。 约束不能使用任何数据库。
我正在使用dask.Dataframe输入这些文件,但我没有看到任何api或函数来查找Dask文档中的差异。
我甚至无法将Dask Dataframes转换为Panda Dataframes,我也无法将此Dataframe转换为任何文本或CSV文件。
有没有任何解决方案来比较这些巨大的文件并找出差异。
请找到我尝试的示例代码。
import dask.dataframe as dd
import numpy.testing as npt
import pandas as pd
filename1 = '/Users/saikatbhattacharjee/Downloads/2008.csv'
df1 = dd.read_csv(filename1, assume_missing=True)
filename2 = '/Users/saikatbhattacharjee/Downloads/2009.csv'
df2 = dd.read_csv(filename2, assume_missing=True )
def assert_frames_equal(actual, expected, use_close=False):
"""
Compare DataFrame items by index and column and
raise AssertionError if any item is not equal.
Ordering is unimportant, items are compared only by label.
NaN and infinite values are supported.
Parameters
----------
actual : pandas.DataFrame
expected : pandas.DataFrame
use_close : bool, optional
If True, use numpy.testing.assert_allclose instead of
numpy.testing.assert_equal.
"""
if use_close:
comp = npt.assert_allclose
else:
comp = npt.assert_equal
assert (isinstance(actual, pd.DataFrame) and
isinstance(expected, pd.DataFrame)), \
'Inputs must both be pandas DataFrames.'
for i, exp_row in expected.iterrows():
assert i in actual.index, 'Expected row {!r} not
found.'.format(i)
act_row = actual.loc[i]
for j, exp_item in exp_row.iteritems():
assert j in act_row.index, \
'Expected column {!r} not found.'.format(j)
act_item = act_row[j]
if comp(act_item, exp_item):
print("CSV files are identical")
else:
print('The difference in CSV files are'.format(j, i))
actual = pd.DataFrame(df1)
expected = pd.Dataframe(df2)
assert_frames_equal(actual, expected)
我在此错误:
File "/Users/saikatbhattacharjee/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/saikatbhattacharjee/.spyder-py3/temp.py", line 52, in <module>
actual = pd.DataFrame(df1)
File "/Users/saikatbhattacharjee/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 354, in __init__
raise ValueError('DataFrame constructor not properly called!')
ValueError: DataFrame constructor not properly called!