从python中的.dat文件读取和计算

时间:2016-06-21 23:47:19

标签: python csv

我需要在python中读取一个.dat文件,该文件总共有12列,有数百万行。我需要将第2,3和4列与第1列分开以进行计算。所以在加载.dat文件之前,是否需要删除所有其他不需要的列?如果没有,我如何有选择地声明列并让python进行数学计算?

.dat文件的一个例子 data.dat

我是python的新手,所以我们将非常感谢开放,阅读和计算的一些指示。

我已根据您的建议添加了我正在使用的代码:

from sys import argv

import pandas as pd



script, filename = argv

txt = open(filename)

print "Here's your file %r:" % filename
print txt.read()

def your_func(row):
    return row['x-momentum'] / row['mass']

columns_to_keep = ['mass', 'x-momentum']
dataframe = pd.read_csv('~/Pictures', delimiter="," , usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)

以及我通过它的错误:

Traceback (most recent call last):
  File "flash.py", line 18, in <module>
    dataframe = pd.read_csv('~/Pictures', delimiter="," , usecols=columns_to_keep)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 529, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 295, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 612, in __init__
    self._make_engine(self.engine)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 747, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1119, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 518, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5030)
ValueError: No columns to parse from file

5 个答案:

答案 0 :(得分:3)

在查看了您的flash.dat文件之后,很明显您需要在处理之前进行一些清理。以下代码将其转换为CSV文件:

import csv

# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./flash.dat").readlines()]

# write it as a new CSV file
with open("./flash.csv", "wb") as f:
    writer = csv.writer(f)
    writer.writerows(datContent)

现在,使用Pandas计算新列。

import pandas as pd

def your_func(row):
    return row['x-momentum'] / row['mass']

columns_to_keep = ['#time', 'x-momentum', 'mass']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)

print dataframe

答案 1 :(得分:2)

尝试类似:

datContent = [i.strip().split() for i in open("filename.dat").readlines()]

然后,您将把数据放在列表中。

如果您想要更复杂的东西,可以使用Pandas,请参阅链接的食谱。

答案 2 :(得分:2)

考虑使用通用read_table()函数(其中read_csv()是一种特殊类型),其中pandas可以轻松导入指定空格分隔符的特定.dat文件sep='\s+'。此外,逐列计算不需要apply()定义的函数。

numpy下面用于判断除零。此外,示例.dat文件的第一列是 #time ,第2,3,4列是 x-momentum y-momentum ,和 mass (代码中的不同表达式,但根据需要进行修改)。

import pandas as pd
import numpy as np

columns_to_keep = ['#time', 'x-momentum', 'y-momentum', 'mass']
df = pd.read_table("flash.dat", sep="\s+", usecols=columns_to_keep)

df['mass_per_time'] = np.where(df['#time'] > 0, df['mass']/df['#time'], np.nan)
df['x-momentum_per_time'] = np.where(df['#time'] > 0, df['x-momentum']/df['#time'], np.nan)
df['y-momentum_per_time'] = np.where(df['#time'] > 0, df['y-momentum']/df['#time'], np.nan)

答案 3 :(得分:2)

train=pd.read_csv("Path",sep=" ::",header=None)

现在您可以访问dat文件。

train.columns=["A","B","C"]# Number of columns you can see in the dat file.

然后您可以将其用作csv文件。

答案 4 :(得分:1)

您在这里遇到的问题是列标题名称中包含空格。您需要修复/忽略它以使pandas.read_csv表现得很好。这将根据字段名称字符串的固定长度将列标题名称读入列表:

import pandas

with open('flash.dat') as f:
    header = f.readline()[2:-1]
    header_fixed = [header[i*23:(i+1)*23].strip() for i in range(26)]
    header_fixed[0] = header_fixed[0][1:] # remove '#' from time

    # pandas doesn't handle "Infinity" properly, read Infinity as NaN, then convert back to infinity
    df = pandas.read_csv(f, sep='\s+', names=header_fixed, na_values="Infinity")
    df.fillna(pandas.np.inf, inplace=True)

# processing
df['new_column'] = df['x-momentum'] / df['mass']