Question

我有一个包含多列的数据集，我只对分析六列中的数据感兴趣。它在一个txt文件中，我想加载文件，然后用标题（时间，模式，事件，xcoord，ycoord，phi）拉出以下列（0,1,2,4,6,7）。总共有十列，以下是数据的示例：

1385940076332   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076336   2   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076339   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076342   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076346   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076350   2   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076353   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076356   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000

当我使用以下代码将数据解析为列时，它似乎只计算数据 - 但我希望能够列出数据以供进一步分析。这是我在@alko中使用的代码：

import pandas as pd
df = pd.read_csv('filtered.txt', header=None, false_values=None, sep='\s+')[[0, 1, 2, 4, 6, 7]]
df.columns = ['time', 'mode', 'event', 'xcoord', 'ycoord', 'phi']
print df

以下是该代码返回的内容：

class 'pandas.core.frame.DataFrame'
Int64Index: 115534 entries, 0 to 115533
Data columns (total 6 columns): 
time      115534  non-null values
mode      115534  non-null values
event     115534  non-null values
xcoord    115534  non-null values
ycoord    115534  non-null values
phi       115534  non-null values
dtypes: float64(3), int64(2), object(1)

因此，我们的目标是从10张原文中提取这6列，标记它们并列出它们。

Answer 1

您可以使用pandas'read_csv解析器：

import pandas as pd
from StringIO import StringIO
s = """1385940076332   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076336   2   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076339   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076342   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076346   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076350   2   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076353   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.000000    0.000000
1385940076356   3   M   subject_avatar  -30.000000  1.000000    -59.028107  180.000000  0.# 000000    0.000000"""

df = pd.read_csv(StringIO(s),header=None, sep='\s+')[[0, 2, 3, 4, 6, 7]]
df.columns = ['time', 'mode', 'event', 'xcoord', 'ycoord', 'phi']
print df
#             time mode           event  xcoord     ycoord  phi
# 0  1385940076332    M  subject_avatar     -30 -59.028107  180
# 1  1385940076336    M  subject_avatar     -30 -59.028107  180
# 2  1385940076339    M  subject_avatar     -30 -59.028107  180
# 3  1385940076342    M  subject_avatar     -30 -59.028107  180
# 4  1385940076346    M  subject_avatar     -30 -59.028107  180
# 5  1385940076350    M  subject_avatar     -30 -59.028107  180
# 6  1385940076353    M  subject_avatar     -30 -59.028107  180
# 7  1385940076356    M  subject_avatar     -30 -59.028107  180

请注意，我更正了列索引，因为您在问题中提供的索引似乎不正确。

从python中的数据集中解析特定列

1 个答案: