我有大型超过10 MB的大型csv文件和大约50多个这样的文件。这些输入具有超过25列和超过50K行。
所有这些都有相同的标题,我试图将它们合并到一个csv与标题只提一次。
选项:一个 代码:适用于小型csv - 25+列,但文件大小为kbs。
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
但是上面的代码不适用于较大的文件并给出错误。
错误:
Traceback (most recent call last):
File "merge_large.py", line 6, in <module>
all_files = glob.glob("*.csv", encoding='utf8', engine='python')
TypeError: glob() got an unexpected keyword argument 'encoding'
lakshmi@lakshmi-HP-15-Notebook-PC:~/Desktop/Twitter_Lat_lon/nasik_rain/rain_2$ python merge_large.py
Traceback (most recent call last):
File "merge_large.py", line 10, in <module>
df = pd.read_csv(file_,index_col=None, header=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 325, in _read
return parser.read()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)
File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)
pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
代码:列25+,但文件大小超过10mb
选项:四个
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
错误:
Traceback (most recent call last):
File "merge_large.py", line 6, in <module>
allFiles = glob.glob("*.csv", sep=None)
TypeError: glob() got an unexpected keyword argument 'sep'
我已广泛搜索但我无法找到将具有相同标头的大型csv文件连接到一个文件中的解决方案。
修改:
代码:
import dask.dataframe as dd
ddf = dd.read_csv('*.csv')
ddf.to_csv('master.csv',index=False)
错误:
Traceback (most recent call last):
File "merge_csv_dask.py", line 5, in <module>
ddf.to_csv('master.csv',index=False)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.py", line 792, in to_csv
return to_csv(self, filename, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/io.py", line 762, in to_csv
compute(*values)
File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 179, in compute
results = get(dsk, keys, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/threaded.py", line 58, in get
**kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 481, in get_async
raise(remote_exception(res, tb))
dask.async.ValueError: could not convert string to float: {u'type': u'Point', u'coordinates': [4.34279, 50.8443]}
Traceback
---------
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/csv.py", line 49, in bytes_read_csv
coerce_dtypes(df, dtypes)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/csv.py", line 73, in coerce_dtypes
df[c] = df[c].astype(dtypes[c])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2950, in astype
raise_on_error=raise_on_error, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 2938, in astype
return self.apply('astype', dtype=dtype, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 2890, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 434, in astype
values=values, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 477, in _astype
values = com._astype_nansafe(values.ravel(), dtype, copy=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/common.py", line 1920, in _astype_nansafe
return arr.astype(dtype
)
答案 0 :(得分:3)
如果我了解您的问题,您将拥有大型csv文件,其结构与要合并到一个大CSV文件中的结构相同。
我的建议是使用Continuum Analytics的dask
来处理这项工作。您可以合并文件,但也可以像大熊猫那样执行核心外计算和数据分析。
### make sure you include the [complete] tag
pip install dask[complete]
首先,检查dask的版本。对我来说,dask = 0.11.0和pandas = 0.18.1
import dask
import pandas as pd
print (dask.__version__)
print (pd.__version__)
这是您所有csvs中读取的代码。使用DropBox示例数据时没有错误。
import dask.dataframe as dd
from dask.delayed import delayed
import dask.bag as db
import glob
filenames = glob.glob('/Users/linwood/Downloads/stack_bundle/rio*.csv')
'''
The key to getting around the CParse error was using sep=None
Came from this post
http://stackoverflow.com/questions/37505577/cparsererror-error-tokenizing-data
'''
# custom saver function for dataframes using newfilenames
def reader(filename):
return pd.read_csv(filename,sep=None)
# build list of delayed pandas csv reads; then read in as dask dataframe
dfs = [delayed(reader)(fn) for fn in filenames]
df = dd.from_delayed(dfs)
'''
This is the final step. The .compute() code below turns the
dask dataframe into a single pandas dataframe with all your
files merged. If you don't need to write the merged file to
disk, I'd skip this step and do all the analysis in
dask. Get a subset of the data you want and save that.
'''
df = df.reset_index().compute()
df.to_csv('./test.csv')
# print the count of values in each column; perfect data would have the same count
# you have dirty data as the counts will show
print (df.count().compute())
下一步是做一些类似熊猫的分析。以下是我的第一个“清理”'tweetFavoriteCt'列数据的代码。所有数据都不是整数,所以我用“0”替换字符串并将其他所有数据转换为整数。一旦我得到整数转换,我会显示一个简单的分析,我将整个数据帧过滤为仅包含favoriteCt大于3的行
# function to convert numbers to integer and replace string with 0; sample analytics in dask dataframe
# you can come up with your own..this is just for an example
def conversion(value):
try:
return int(value)
except:
return int(0)
# apply the function to the column, create a new column of cleaned data
clean = df['tweetFavoriteCt'].apply(lambda x: (conversion(x)),meta=('stuff',str))
# set new column equal to our cleaning code above; your data is dirty :-(
df['cleanedFavoriteCt'] = clean
最后一段代码显示了dask分析以及如何将此合并文件加载到pandas中,并将合并文件写入磁盘。请注意,如果你有大量的CSV,当你使用下面的.compute()
代码时,它会将这个合并的csv加载到内存中。
# retreive the 50 tweets with the highest favorite count
print(df.nlargest(50,['cleanedFavoriteCt']).compute())
# only show me the tweets that have been favorited at least 3 times
# TweetID 763525237166268416, is VERRRRY popular....7000+ favorites
print((df[df.cleanedFavoriteCt.apply(lambda x: x>3,meta=('stuff',str))]).compute())
'''
This is the final step. The .compute() code below turns the
dask dataframe into a single pandas dataframe with all your
files merged. If you don't need to write the merged file to
disk, I'd skip this step and do all the analysis in
dask. Get a subset of the data you want and save that.
'''
df = df.reset_index().compute()
df.to_csv('./test.csv')
现在,如果要切换到合并的csv文件的pandas:
import pandas as pd
dff = pd.read_csv('./test.csv')
让我知道这是否有效。
停在这里
第一步是确保安装了dask
。有install instructions for dask
in the documentation page但这应该有效:
安装dask后,可以轻松读取文件。
先管理一些。假设我们有一个带有csvs的目录,其中文件名为my18.csv
,my19.csv
,my20.csv
等。名称标准化和单个目录位置是关键。如果您将csv文件放在一个目录中并以某种方式序列化名称,则此方法有效。
分步骤:
dask.dataframe
个对象。如果需要,您可以在此步骤后立即执行类似熊猫的操作。import dask.dataframe as dd
ddf = dd.read_csv('./daskTest/my*.csv')
ddf.describe().compute()
master.csv
ddf.to_csv('./daskTest/master.csv',index=False)
master.csv
读入dask.dataframe对象进行计算。这也可以在上面的第一步之后完成; dask可以执行pandas,就像在staged文件上执行操作一样......这是一种在Python中执行“大数据”的方法# reads in the merged file as one BIG out-of-core dataframe; can perform functions like pangas
newddf = dd.read_csv('./daskTest/master.csv')
#check the length; this is now length of all merged files. in this example, 50,000 rows times 11 = 550000 rows.
len(newddf)
# perform pandas-like summary stats on entire dataframe
newddf.describe().compute()
希望这有助于回答您的问题。通过三个步骤,您可以读入所有文件,合并到单个数据帧,并将大量数据帧写入磁盘,只有一个标题和所有行。