Question

所以假设我有一个csv文件包含如下数据：

'time'  'speed'
0       2.3
0       3.4
0       4.1
0       2.1
1       1.3
1       3.5
1       5.1
1       1.1
2       2.3
2       2.4
2       4.4
2       3.9

我希望能够返回这个文件，以便对于标题'time'下的每个增加的数字，我精确地在列速度中找到的最大数量，并返回该数字的速度旁边的数字时间在数组中。我正在使用的实际csv文件要大得多，所以我想迭代大量的数据而不是在“时间”为0,1或2的情况下运行它。

所以基本上我希望这个回归：

array([[0,41], [1,5.1],[2,4.4]])

具体使用numpy。

Answer 1

在NumPy中以完全矢量化的方式完成这一点有点棘手。这是一个选项：

a = numpy.genfromtxt("a.csv", names=["time", "speed"], skip_header=1)
a.sort()
unique_times = numpy.unique(a["time"])
indices = a["time"].searchsorted(unique_times, side="right") - 1
result = a[indices]

这会将数据加载到具有两个字段的一维数组中并首先对其进行排序。结果是一个数组，其数据按时间分组，最大速度值始终是每组中的最后一个。然后，我们确定发生的唯一时间值，并在每个时间值的数组中找到最右边的条目。

Answer 2

pandas很适合这种东西：

>>> from io import StringIO
>>> import pandas as pd
>>> df = pd.read_table(StringIO("""\
... time  speed
... 0       2.3
... 0       3.4
... 0       4.1
... 0       2.1
... 1       1.3
... 1       3.5
... 1       5.1
... 1       1.1
... 2       2.3
... 2       2.4
... 2       4.4
... 2       3.9
... """), delim_whitespace=True)
>>> df
    time  speed
0      0    2.3
1      0    3.4
2      0    4.1
3      0    2.1
4      1    1.3
5      1    3.5
6      1    5.1
7      1    1.1
8      2    2.3
9      2    2.4
10     2    4.4
11     2    3.9

[12 rows x 2 columns]

拥有数据框后，您所需要的只有groupby时间和aggregate最高速度：

>>> df.groupby('time')['speed'].aggregate(max)
time
0       4.1
1       5.1
2       4.4
Name: speed, dtype: float64

CSV数据 - 使用numpy的列段的最大值

2 个答案: