拆分列并写入单独的输出文件

时间:2016-05-17 23:43:03

标签: python python-2.7 file pandas dataset

我有一个包含8列和大约500万行的数据集。文件大小超过400 MB。我想分开列。文件扩展名为.dat,列one-space已分隔。

输入:

00022d3f5b17 00022d9064bc 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 00022dba8f51 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 00022de1c6c1 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 003065f30f37 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 00904b48a3b6 1073260801 1073260803 819251 440006 819251 440006
00022d9064bc 00904b83a0ea 1073260803 1073260810 819213 439954 819213 439954
00904b4557d3 00904b85d3cf 1073260803 1073261920 817526 439458 817526 439458
00022de73863 00904b14b494 1073260804 1073265410 817558 439525 817558 439525

代码:

import pandas as pd 

df = pd.read_csv('sorted.dat', sep=' ', header=None, names=['id_1', 'id_2', 'time_1', 'time_2', 'gps_1', 'gps_2', 'gps_3', 'gps_4'])

#print df

df.to_csv('output_1.csv', columns = ['id_1', 'time_1', 'time_2', 'gps_1', 'gps_2'])

df.to_csv('output_2.csv', columns = ['id_2', 'time_1', 'time_2', 'gps_3', 'gps_4']) 

输出将是col[1], col[3], col[4], col[5], col[6]的一个文件和col[2], col[3], col[4], col[7], col[8]的另一个输出。

我收到此错误

Traceback (most recent call last):
  File "split_col_pandas.py", line 3, in <module>
    df = pd.read_csv('dartmouthsorted.dat', sep=' ', header=None, names=['id_1', 'id_2', 'time_1', 'time_2', 'gps_1', 'gps_2', 'gps_3', 'gps_4'])
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 325, in _read
    return parser.read()
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 823, in read
    df = DataFrame(col_dict, columns=columns, index=index)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 224, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 360, in _init_dict
    return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 5241, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3999, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 4076, in form_blocks
    int_blocks = _multi_blockify(int_items)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 4145, in _multi_blockify
    values, placement = _stack_arrays(list(tup_block), dtype)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 4188, in _stack_arrays
    stacked = np.empty(shape, dtype=dtype)
MemoryError

2 个答案:

答案 0 :(得分:1)

试试这个:

columns = ['id_1', 'time_1', 'time_2', 'gps_1', 'gps_2']
df[columns].to_csv('output_1.csv')

columns = ['id_2', 'time_1', 'time_2', 'gps_3', 'gps_4']
df[columns].to_csv('output_2.csv')

另外,请查看这篇关于Python内存错误的帖子: Memory errors and list limits?

更新编辑

发布作者还要求在保存两个新的csv文件后,重新组合output_1.csv和output_2.csv,以便id_1id_2位于同一列中,并且gps_1gps_3成为一个列,gps_2gps_4成为一列。

有很多方法可以做到这一点,但这是一种方式(选择可读性而非效率):

columns = ['id_merged', 'time_1', 'time_2', 'gps_1or3', 'gps_2or4']
df1 = pd.read_csv('output_1.csv', names=columns, skiprows=1)
df2 = pd.read_csv('output_2.csv', names=columns, skiprows=1)

df = pd.concat([df1, df2])  # your final dataframe

这样做的一个潜在问题是,您最终会在某些地方获得null值,因此需要对其进行适当处理,否则您会抛出错误,此外还有新的危险id_merged列会有重复的密钥,但这是另一个问题的问题...

有关更新的详细信息,请参阅有关联接,联接和合并的文档:http://pandas.pydata.org/pandas-docs/stable/merging.html

答案 1 :(得分:1)

这种方法非常节省内存,因为它一次只能在一行上运行。它也不需要Pandas。

 vm.export = function () {
            //PopUps.showLoading()
            $http.get(Url).then(function (result) {
                //PopUps.hideLoading()
                var headers = result.headers()
                var blob = new Blob([result.data], { type: headers['content-type'] })
                var windowUrl = (window.URL || window.webkitURL)
                var downloadUrl = windowUrl.createObjectURL(blob)
                var anchor = document.createElement("a")
                anchor.href = downloadUrl
                var fileNamePattern = /filename[^;=\n]*=((['"]).*?\2|[^;\n]*)/
                anchor.download = fileNamePattern.exec(headers['content-disposition'])[1]
                document.body.appendChild(anchor)
                anchor.click()
                windowUrl.revokeObjectURL(blob)
            })
        }