python loadtxt读取分隔文件

时间:2016-01-21 14:13:36

标签: python numpy pandas split delimited-text

我只是从Matlab切换到Python,并希望学习如何使用loadtxt包中的numpy在Python中读取此文件。 (我在textscan中使用Matlab来阅读它)

"07220S006","14/01/12 01:59:50",10,"0"

"07220S006","14/01/12 02:00:00",10,"0"

"07220S006","14/01/12 02:00:10",10,"0"

我可以使用Python正则表达式包中的split函数来读取此文件,但是,如果我的数据包含大约几十行这样的行,则应用了split函数在每一行上将导致显着的分析时间。所以我认为loadtxt在这种情况下会做得更好。我找到了许多读取类似文件的解决方案,但是这个文件比这些例子复杂得多,我不知道怎么读它。

感谢任何帮助和建议

2 个答案:

答案 0 :(得分:1)

您可以使用pandas轻松完成,然后如果您需要numpy数组,则可以访问values

import pandas as pd
from io import StringIO

data = """
"07220S006","14/01/12 01:59:50",10,"0"
"07220S006","14/01/12 02:00:00",10,"0"
"07220S006","14/01/12 02:00:10",10,"0"
"""

df = pd.read_csv(StringIO(data), header=None)

print(df)
           0                  1   2  3
0  07220S006  14/01/12 01:59:50  10  0
1  07220S006  14/01/12 02:00:00  10  0
2  07220S006  14/01/12 02:00:10  10  0


print(df.values)
array([['07220S006', '14/01/12 01:59:50', 10, 0],
       ['07220S006', '14/01/12 02:00:00', 10, 0],
       ['07220S006', '14/01/12 02:00:10', 10, 0]], dtype=object)

修改

IUUC您希望将日期列拆分为日期和时间(或年份,月份等)/您可以先将该列转换为datetime对象pd.to_datetime,然后访问包含datetime date_col = pd.to_datetime(df[1]) date_col.dt.year print(date_col.dt.year) 0 2012 1 2012 2 2012 Name: 1, dtype: int64 并将其写入新列:

print(date_col.dt.strftime("%Y/%m %H:%M"))
0    2012/01 01:59
1    2012/01 02:00
2    2012/01 02:00
Name: 1, dtype: object

或者你可以将它转换为字符串,如果你想要dt,例如:

df['year'] = date_col.dt.year

print(df)
           0                  1   2  3  year
0  07220S006  14/01/12 01:59:50  10  0  2012
1  07220S006  14/01/12 02:00:00  10  0  2012
2  07220S006  14/01/12 02:00:10  10  0  2012

您可以轻松创建:

Response.Redirect(Request.RawUrl); 

答案 1 :(得分:0)

将引号中的任何值作为字符串处理,并使用numpy.genfromtxt代替(更好地处理缺失值):

import numpy as np
from StringIO import String IO

example_data = '"07220S006","14/01/12 01:59:50",10,"0"\n"07220S006","14/01/12 02:00:00",10,"0"\n"07220S006","14/01/12 02:00:10",10,"0"'
# approximation of your input data

data = np.genfromtxt(StringIO(example_data), delimiter=',', dtype='S16,S16,i4,S3')
# dtypes: Sx - x char string, i4 - 32 bit integer
# more here: http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html

print data
[('"07220S006"', '"14/01/12 01:59:50"', 10, '"0"')
 ('"07220S006"', '"14/01/12 02:00:00"', 10, '"0"')
 ('"07220S006"', '"14/01/12 02:00:10"', 10, '"0"')]

不能想到使用numpy删除引号的简单方法,我认为如上面的帖子中使用pandas可能是更好的解决方案或python CSVReader