我有一个来自Linux程序的异常文件;示例的第一行是:
1 1011.720000 1830.340000 0 0 0 191340 ? 1.000000
2 1011.720000 1830.340000 0 0 0 725670 ? 2.000000
3 1011.720000 1830.340000 0 0 0 1.4378e+06 ? 3.000000
4 1011.720000 1830.340000 0 0 0 2.178e+06 ? 4.000000
5 1011.720000 1830.340000 0 0 0 2.8806e+06 ? 5.000000
6 1011.720000 1830.340000 0 0 0 3.5353e+06 ? 6.000000
7 1011.720000 1830.340000 0 0 0 4.1598e+06 ? 7.000000
8 1011.720000 1830.340000 0 0 0 4.7729e+06 ? 8.000000
9 1011.720000 1830.340000 0 0 0 5.3924e+06 ? 9.000000
10 1011.720000 1830.340000 0 0 0 6.0281e+06 ? 10.000000
我只需要从每一行中提取两个值:
191340
725670
1.4378e+06
2.178e+06
.... etc
1.00000
2.00000
3.00000
4.00000
.... etc
此代码:
import csv
with open('NGC1365GaiaPhotomLogTestTenLines.dat', "rb") as infile:
read = csv.reader(infile)
for row in read :
print (row)
生成:
[' 1 1011.720000 1830.340000 0 0 0 191340 ? 1.000000']
[' 2 1011.720000 1830.340000 0 0 0 725670 ? 2.000000']
[' 3 1011.720000 1830.340000 0 0 0 1.4378e+06 ? 3.000000']
[' 4 1011.720000 1830.340000 0 0 0 2.178e+06 ? 4.000000']
[' 5 1011.720000 1830.340000 0 0 0 2.8806e+06 ? 5.000000']
[' 6 1011.720000 1830.340000 0 0 0 3.5353e+06 ? 6.000000']
[' 7 1011.720000 1830.340000 0 0 0 4.1598e+06 ? 7.000000']
[' 8 1011.720000 1830.340000 0 0 0 4.7729e+06 ? 8.000000']
[' 9 1011.720000 1830.340000 0 0 0 5.3924e+06 ? 9.000000']
[' 10 1011.720000 1830.340000 0 0 0 6.0281e+06 ? 10.000000']
问题在于生成的列表不是用逗号分隔的好项目-输入文件中的项目用空格分隔,并且空格数可以变化,因为第一列中值的格式也可以变化。 / p>
尽管我不会感到困难,但是我咨询了很多线程,却一无所获。
答案 0 :(得分:3)
与这里的其他答案相反,我认为您应该使用csv
模块。如果文件中包含标题或带引号的字段,则比在事实发生后尝试修改自定义解决方案要快乐得多:
with open('filename') as infile:
r = csv.reader(infile, delimiter=' ', skipinitialspace=True)
for row in r:
print(row)
您的文件似乎在计算机上用制表符分隔。在这种情况下,您可以在上面将delimiter=' '
更改为delimiter='\t'
。
您还可以使用pandas,它具有更通用的空白模式
df = pd.read_csv("filename", header=None, delim_whitespace=True)
答案 1 :(得分:2)
@Eugen Constantin Dinca和@tobias_k简化代码
with open('csv.dat', "rb") as infile:
for row in infile:
print row.split()
输出:
['1', '1011.720000', '1830.340000', '0', '0', '0', '191340', '?', '1.000000']
['2', '1011.720000', '1830.340000', '0', '0', '0', '725670', '?', '2.000000']
['3', '1011.720000', '1830.340000', '0', '0', '0', '1.4378e+06', '?', '3.000000']
['4', '1011.720000', '1830.340000', '0', '0', '0', '2.178e+06', '?', '4.000000']
['5', '1011.720000', '1830.340000', '0', '0', '0', '2.8806e+06', '?', '5.000000']
['6', '1011.720000', '1830.340000', '0', '0', '0', '3.5353e+06', '?', '6.000000']
['7', '1011.720000', '1830.340000', '0', '0', '0', '4.1598e+06', '?', '7.000000']
['8', '1011.720000', '1830.340000', '0', '0', '0', '4.7729e+06', '?', '8.000000']
['9', '1011.720000', '1830.340000', '0', '0', '0', '5.3924e+06', '?', '9.000000']
['10', '1011.720000', '1830.340000', '0', '0', '0', '6.0281e+06', '?', '10.000000']
答案 2 :(得分:0)
这是您可以使用的代码
关于您的代码csv.reader
的几点要点也不过分。一切都使用简单的内置程序完成-无需外部依赖。
也不要使用read
这样的变量名。
lines = """1 1011.720000 1830.340000 0 0 0 191340 ? 1.000000
2 1011.720000 1830.340000 0 0 0 725670 ? 2.000000
3 1011.720000 1830.340000 0 0 0 1.4378e+06 ? 3.000000
4 1011.720000 1830.340000 0 0 0 2.178e+06 ? 4.000000
5 1011.720000 1830.340000 0 0 0 2.8806e+06 ? 5.000000
6 1011.720000 1830.340000 0 0 0 3.5353e+06 ? 6.000000
7 1011.720000 1830.340000 0 0 0 4.1598e+06 ? 7.000000
8 1011.720000 1830.340000 0 0 0 4.7729e+06 ? 8.000000
9 1011.720000 1830.340000 0 0 0 5.3924e+06 ? 9.000000
10 1011.720000 1830.340000 0 0 0 6.0281e+06 ? 10.000000"""
for line in lines.split("\n"):
toks = line.split() # This should split the line into tokens separated by one or more white space characters.
if len(toks) == 9: # Just to make sure there are enough tokens.
# do whatever you want
print (toks[6])