我需要在python中读取一个.dat文件,该文件总共有12列,有数百万行。我需要将第2,3和4列与第1列分开以进行计算。所以在加载.dat文件之前,是否需要删除所有其他不需要的列?如果没有,我如何有选择地声明列并让python进行数学计算?
.dat文件的一个例子 data.dat
我是python的新手,所以我们将非常感谢开放,阅读和计算的一些指示。
我已根据您的建议添加了我正在使用的代码:
from sys import argv
import pandas as pd
script, filename = argv
txt = open(filename)
print "Here's your file %r:" % filename
print txt.read()
def your_func(row):
return row['x-momentum'] / row['mass']
columns_to_keep = ['mass', 'x-momentum']
dataframe = pd.read_csv('~/Pictures', delimiter="," , usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)
以及我通过它的错误:
Traceback (most recent call last):
File "flash.py", line 18, in <module>
dataframe = pd.read_csv('~/Pictures', delimiter="," , usecols=columns_to_keep)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 529, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 295, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 612, in __init__
self._make_engine(self.engine)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 747, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1119, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 518, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5030)
ValueError: No columns to parse from file
答案 0 :(得分:3)
在查看了您的flash.dat
文件之后,很明显您需要在处理之前进行一些清理。以下代码将其转换为CSV文件:
import csv
# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./flash.dat").readlines()]
# write it as a new CSV file
with open("./flash.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
现在,使用Pandas计算新列。
import pandas as pd
def your_func(row):
return row['x-momentum'] / row['mass']
columns_to_keep = ['#time', 'x-momentum', 'mass']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)
print dataframe
答案 1 :(得分:2)
尝试类似:
datContent = [i.strip().split() for i in open("filename.dat").readlines()]
然后,您将把数据放在列表中。
如果您想要更复杂的东西,可以使用Pandas,请参阅链接的食谱。
答案 2 :(得分:2)
考虑使用通用read_table()
函数(其中read_csv()
是一种特殊类型),其中pandas可以轻松导入指定空格分隔符的特定.dat文件sep='\s+'
。此外,逐列计算不需要apply()
定义的函数。
numpy下面用于判断除零。此外,示例.dat文件的第一列是 #time ,第2,3,4列是 x-momentum , y-momentum ,和 mass (代码中的不同表达式,但根据需要进行修改)。
import pandas as pd
import numpy as np
columns_to_keep = ['#time', 'x-momentum', 'y-momentum', 'mass']
df = pd.read_table("flash.dat", sep="\s+", usecols=columns_to_keep)
df['mass_per_time'] = np.where(df['#time'] > 0, df['mass']/df['#time'], np.nan)
df['x-momentum_per_time'] = np.where(df['#time'] > 0, df['x-momentum']/df['#time'], np.nan)
df['y-momentum_per_time'] = np.where(df['#time'] > 0, df['y-momentum']/df['#time'], np.nan)
答案 3 :(得分:2)
train=pd.read_csv("Path",sep=" ::",header=None)
现在您可以访问dat文件。
train.columns=["A","B","C"]# Number of columns you can see in the dat file.
然后您可以将其用作csv文件。
答案 4 :(得分:1)
您在这里遇到的问题是列标题名称中包含空格。您需要修复/忽略它以使pandas.read_csv
表现得很好。这将根据字段名称字符串的固定长度将列标题名称读入列表:
import pandas
with open('flash.dat') as f:
header = f.readline()[2:-1]
header_fixed = [header[i*23:(i+1)*23].strip() for i in range(26)]
header_fixed[0] = header_fixed[0][1:] # remove '#' from time
# pandas doesn't handle "Infinity" properly, read Infinity as NaN, then convert back to infinity
df = pandas.read_csv(f, sep='\s+', names=header_fixed, na_values="Infinity")
df.fillna(pandas.np.inf, inplace=True)
# processing
df['new_column'] = df['x-momentum'] / df['mass']