Question

我正在尝试找到一种有效的方法，将包含整数点的数据行组合在一起，并将它们存储为Python对象。数据由X和Y坐标点组成，表示为逗号分隔的字符串。这些点必须配对，如(x_1, y_1), (x_2, y_2), ...等，然后存储为对象列表，其中每个点都是一个对象。 get_data下面的函数生成此示例数据：

def get_data(N=100000, M=10):
    import random
    data = []
    for n in range(N):
        pair = [[str(random.randint(1, 10)) for x in range(M)],
                [str(random.randint(1, 10)) for x in range(M)]]
        row = [",".join(pair[0]),
               ",".join(pair[1])]
        data.append(row)
    return data

我现在的解析代码是：

class Point:
    def __init__(self, a, b):
        self.a = a
        self.b = b

def test():
    import time
    data = get_data()
    all_point_sets = []
    time_start = time.time()
    for row in data:
        point_set = []
        first_points, second_points = row
        # Convert points from strings to integers
        first_points = map(int, first_points.split(","))
        second_points = map(int, second_points.split(","))
        paired_points = zip(first_points, second_points)
        curr_points = [Point(p[0], p[1]) \
                       for p in paired_points]
        all_point_sets.append(curr_points)
    time_end = time.time()
    print "total time: ", (time_end - time_start)

目前，100,000点需要近7秒，这似乎效率很低。部分效率低下似乎源于first_points，second_points和paired_points的计算 - 以及将这些转换为对象。

效率低下的另一部分似乎是all_point_sets的建立。取出all_point_sets.append(...)行似乎使代码从~7秒到2秒！

如何加快速度？感谢。

关注感谢大家的好建议 - 他们都很有帮助。但即使有了所有改进，处理100,000个条目仍然需要3秒钟。我不确定为什么在这种情况下它不仅仅是即时的，以及是否有一种可以立即实现的替代表示。在Cython中编码这会改变一些东西吗？有人可以提供一个例子吗？再次感谢。

Answer 1

在处理大型数量的对象时，通常可以使用的最大性能增强功能是关闭垃圾收集器。每个“生成”对象，垃圾收集器遍历内存中的所有活动对象，查找作为循环的一部分但未被活动对象指向的对象，因此有资格进行内存回收。有关一些信息，请参阅Doug Helmann's PyMOTW GC article（也许可以通过谷歌找到更多信息并做出一些决定）。垃圾收集器默认运行，每700个左右的对象创建但未回收，后续运行的次数少一些（我忘记确切的细节）。

使用标准元组而不是Point类可以节省一些时间（使用namedtuple介于两者之间），聪明的解包可以节省一些时间，但是在你的之前关闭gc可以获得最大的收益创建了许多您知道的对象，不需要gc'd，然后再将其重新打开。

一些代码：

def orig_test_gc_off():
    import time
    data = get_data()
    all_point_sets = []
    import gc
    gc.disable()
    time_start = time.time()
    for row in data:
        point_set = []
        first_points, second_points = row
        # Convert points from strings to integers
        first_points = map(int, first_points.split(","))
        second_points = map(int, second_points.split(","))
        paired_points = zip(first_points, second_points)
        curr_points = [Point(p[0], p[1]) \
                       for p in paired_points]
        all_point_sets.append(curr_points)
    time_end = time.time()
    gc.enable()
    print "gc off total time: ", (time_end - time_start)

def test1():
    import time
    import gc
    data = get_data()
    all_point_sets = []
    time_start = time.time()
    gc.disable()
    for index, row in enumerate(data):
        first_points, second_points = row
        curr_points = map(
            Point,
            [int(i) for i in first_points.split(",")],
            [int(i) for i in second_points.split(",")])
        all_point_sets.append(curr_points)
    time_end = time.time()
    gc.enable()
    print "variant 1 total time: ", (time_end - time_start)

def test2():
    import time
    import gc
    data = get_data()
    all_point_sets = []
    gc.disable()
    time_start = time.time()
    for index, row in enumerate(data):
        first_points, second_points = row
        first_points = [int(i) for i in first_points.split(",")]
        second_points = [int(i) for i in second_points.split(",")]
        curr_points = [(x, y) for x, y in zip(first_points, second_points)]
        all_point_sets.append(curr_points)
    time_end = time.time()
    gc.enable()
    print "variant 2 total time: ", (time_end - time_start)

orig_test()
orig_test_gc_off()
test1()
test2()

一些结果：

>>> %run /tmp/flup.py
total time:  6.90738511086
gc off total time:  4.94075202942
variant 1 total time:  4.41632509232
variant 2 total time:  3.23905301094

Answer 2

与pypy一起运行会产生很大的不同

$ python pairing_strings.py 
total time:  2.09194397926
$ pypy pairing_strings.py 
total time:  0.764246940613

禁用gc对pypy没有帮助

$ pypy pairing_strings.py 
total time:  0.763386964798

Point的fortuple使情况更糟

$ pypy pairing_strings.py 
total time:  0.888827085495

使用itertools.imap和itertools.izip

$ pypy pairing_strings.py 
total time:  0.615751981735

使用memoized版本的int和迭代器来避免使用zip

$ pypy pairing_strings.py 
total time:  0.423738002777

这是我完成的代码。

def test():
    import time
    def m_int(s, memo={}):
        if s in memo:
            return memo[s]
        else:
            retval = memo[s] = int(s)
            return retval
    data = get_data()
    all_point_sets = []
    time_start = time.time()
    for xs, ys in data:
        point_set = []
        # Convert points from strings to integers
        y_iter = iter(ys.split(","))
        curr_points = [Point(m_int(i), m_int(next(y_iter))) for i in xs.split(",")]
        all_point_sets.append(curr_points)
    time_end = time.time()
    print "total time: ", (time_end - time_start)

Answer 3

我会

使用numpy数组解决此问题（Cython是一个选项，如果这还不够快）。
将点存储为向量而非单个Point实例。
依赖现有的解析器
（如果可能的话）解析数据一次，然后将其存储为二进制格式，如hdf5，以便进一步计算，这将是最快的选择（见下文）

Numpy内置函数来读取文本文件，例如loadtxt。如果您将数据存储在结构化数组中，则不一定需要将其转换为其他数据类型。我将使用Pandas这是numpy之上的库构建。处理和处理结构化数据更方便一些。 Pandas有自己的文件解析器read_csv。

为了计时，我将数据写入文件，就像您原来的问题一样（它基于您的get_data）：

import numpy as np
import pandas as pd

def create_example_file(n=100000, m=20):
    ex1 = pd.DataFrame(np.random.randint(1, 10, size=(10e4, m)),
                       columns=(['x_%d' % x for x in range(10)] +
                                ['y_%d' % y for y in range(10)]))
    ex1.to_csv('example.csv', index=False, header=False)
    return

这是我用来读取pandas.DataFrame中的数据的代码：

def with_read_csv(csv_file):
    df = pd.read_csv(csv_file, header=None,
                     names=(['x_%d' % x for x in range(10)] +
                            ['y_%d' % y for y in range(10)]))
    return df

（请注意，我假设您的文件中没有标题，因此我必须创建列名。）

读取数据的速度很快，内存效率应该更高（参见this question），数据存储在数据结构中，您可以快速，矢量化的方式进一步处理：

In [18]: %timeit string_to_object.with_read_csv('example.csv')
1 loops, best of 3: 553 ms per loop

开发分支中有一个新的C based parser，在我的系统上需要414毫秒。您的测试在我的系统上需要2.29秒，但它不具有可比性，因为不会从文件中读取数据并且您创建了Point个实例。

如果您曾读过数据，可以将其存储在hdf5文件中：

In [19]: store = pd.HDFStore('example.h5')

In [20]: store['data'] = df

In [21]: store.close()

下次需要数据时，您可以从此文件中读取数据，这非常快：

In [1]: store = pd.HDFStore('example.h5')

In [2]: %timeit df = store['data']
100 loops, best of 3: 16.5 ms per loop

但是，如果您需要多次使用相同的数据，它将仅适用。

在进行进一步计算时，使用具有大型数据集的基于numpy的数组将具有优势。如果您可以使用向量化Cython函数和索引，numpy不一定会更快，如果您真的需要迭代，它会更快（另请参阅this answer）。

Answer 4

更快的方法，使用Numpy（加速 7x ）：

import numpy as np
txt = ','.join(','.join(row) for row in data)
arr = np.fromstring(txt, dtype=int, sep=',')
return arr.reshape(100000, 2, 10).transpose((0,2,1))

效果比较：

def load_1(data):
    all_point_sets = []
    gc.disable()
    for xs, ys in data:
        all_point_sets.append(zip(map(int, xs.split(',')), map(int, ys.split(','))))
    gc.enable()
    return all_point_sets

def load_2(data):
    txt = ','.join(','.join(row) for row in data)
    arr = np.fromstring(txt, dtype=int, sep=',')
    return arr.reshape(100000, 2, 10).transpose((0,2,1))

load_1在我的机器上运行1.52秒; load_2以 0.20 秒运行，提升了7倍。这里最大的警告是，它要求你（1）事先知道所有事物的长度，（2）每行包含完全相同的点数。这适用于您的get_data输出，但可能不适用于您的真实数据集。

Answer 5

我通过使用数组获得了50％的改进，并且在访问时懒惰地构造了Point对象的持有者对象。我还“插入”Point对象以获得更好的存储效率。但是，元组可能会更好。

如果可能的话，更改数据结构也可能会有所帮助。但这永远不会是瞬间的。

from array import array

class Point(object):
    __slots__ = ["a", "b"]
    def __init__(self, a, b):
        self.a = a
        self.b = b

    def __repr__(self):
        return "Point(%d, %d)" % (self.a, self.b)

class Points(object):
    def __init__(self, xs, ys):
        self.xs = xs
        self.ys = ys

    def __getitem__(self, i):
        return Point(self.xs[i], self.ys[i])

def test3():
    xs = array("i")
    ys = array("i")
    time_start = time.time()
    for row in data:
        xs.extend([int(val) for val in row[0].split(",")])
        ys.extend([int(val) for val in row[1].split(",")])
    print ("total time: ", (time.time() - time_start))
    return Points(xs, ys)

但是当处理大量数据时，我通常会使用numpy N维数组（ndarray）。如果原始数据结构可以改变，那么这可能是最快的。如果它可以被构造成读取x，y线性对，然后重塑ndarray。

Answer 6

让Point成为namedtuple（加速率提高约10％）：

from collections import namedtuple
Point = namedtuple('Point', 'a b')

在迭代期间解压缩（约2-4％加速）：
```
for xs, ys in data:
```

使用n - map的参数形式以避免压缩（约10％加速）：

curr_points = map(Point,
    map(int, xs.split(',')),
    map(int, ys.split(',')),
)

鉴于点集很短，生成器可能过度，因为它们具有更高的固定开销。

Answer 7

cython能够将速度提高5.5倍

$ python split.py
total time:  2.16252303123
total time:  0.393486022949

这是我使用的代码

split.py

import time
import pyximport; pyximport.install()
from split_ import test_


def get_data(N=100000, M=10):
    import random
    data = []
    for n in range(N):
        pair = [[str(random.randint(1, 100)) for x in range(M)],
                [str(random.randint(1, 100)) for x in range(M)]]
        row = [",".join(pair[0]),
               ",".join(pair[1])]
        data.append(row)
    return data

class Point:
    def __init__(self, a, b):
        self.a = a
        self.b = b

def test(data):
    all_point_sets = []
    for row in data:
        point_set = []
        first_points, second_points = row
        # Convert points from strings to integers
        first_points = map(int, first_points.split(","))
        second_points = map(int, second_points.split(","))
        paired_points = zip(first_points, second_points)
        curr_points = [Point(p[0], p[1]) \
                       for p in paired_points]
        all_point_sets.append(curr_points)
    return all_point_sets

data = get_data()
for func in test, test_:
    time_start = time.time()
    res = func(data)
    time_end = time.time()
    print "total time: ", (time_end - time_start)

split_.pyx

from libc.string cimport strsep
from libc.stdlib cimport atoi

cdef class Point:
    cdef public int a,b

    def __cinit__(self, a, b):
        self.a = a
        self.b = b

def test_(data):
    cdef char *xc, *yc, *xt, *yt
    cdef char **xcp, **ycp
    all_point_sets = []
    for xs, ys in data:
        xc = xs
        xcp = &xc
        yc = ys
        ycp = &yc
        point_set = []
        while True:
            xt = strsep(xcp, ',')
            if xt is NULL:
                break
            yt = strsep(ycp, ",")
            point_set.append(Point(atoi(xt), atoi(yt)))
        all_point_sets.append(point_set)
    return all_point_sets

进一步探索我可以大致分解一些cpu资源

         5% strsep()
         9% atoi()
        23% creating Point instances
        35% all_point_sets.append(point_set)

如果cython能够直接从csv（或其他）文件中读取而不必遍历Python对象，我希望可以有所改进。

Answer 8

你可以休息几秒钟：

class Point2(object):
    __slots__ = ['a','b']
    def __init__(self, a, b):
        self.a = a
        self.b = b

def test_new(data):
    all_point_sets = []
    for row in data:
        first_points, second_points = row
        r0 = map(int, first_points.split(","))
        r1 = map(int, second_points.split(","))
        cp = map(Point2, r0, r1)
        all_point_sets.append(cp)

给了我

In [24]: %timeit test(d)
1 loops, best of 3: 5.07 s per loop

In [25]: %timeit test_new(d)
1 loops, best of 3: 3.29 s per loop

我通过在all_point_sets中预先分配空间，间歇性地削减了0.3秒，但这可能只是噪音。当然，有一种老式的方法可以让事情变得更快：

localhost-2:coding $ pypy pointexam.py
1.58351397514

Answer 9

数据是制表符分隔文件，由逗号列表组成分开的整数。

使用示例get_data()我创建了一个.csv文件，如下所示：

1,6,2,8,2,3,5,9,6,6     10,4,10,5,7,9,6,1,9,5
6,2,2,5,2,2,1,7,7,9     7,6,7,1,3,7,6,2,10,5
8,8,9,2,6,10,10,7,8,9   4,2,10,3,4,4,1,2,2,9
...

然后我通过JSON滥用C优化解析：

def test2():
    import json
    import time
    time_start = time.time()
    with open('data.csv', 'rb') as f:
        data = f.read()
    data = '[[[' + ']],[['.join(data.splitlines()).replace('\t', '],[') + ']]]'
    all_point_sets = [Point(*xy) for row in json.loads(data) for xy in zip(*row)]
    time_end = time.time()
    print "total time: ", (time_end - time_start)

我的盒子上的结果：原来的test()〜8s，gc禁用~6s，而我的版本（包括I / O）分别给出~6s和~4s。即约50％加速。但是看一下分析器数据显然最大的瓶颈在于对象实例化本身，所以 Matt Anderson 的答案会让你获得CPython上最多的收益。

Answer 10

您如何通过.x和.y属性访问坐标？令我惊讶的是，我的测试表明，最大的单次接收不是对list.append()的调用，而是Point对象的构造。它们构建为元组的时间要长四倍，并且有很多。只需用代码中的元组Point(int(x), int(y))替换(int(x), int(y))，就可以节省超过50％的总执行时间（Win XP上的Python 2.6）。也许您当前的代码仍有空间来优化它？

如果您确实准备访问.x和.y的坐标，则可以尝试使用collections.namedtuple。它没有普通元组那么快，但似乎比代码中的Pair类快得多（我正在对冲，因为单独的时序基准测试给了我奇怪的结果）。

Pair = namedtuple("Pair", "x y")  # instead of the Point class
...
curr_points = [ Pair(x, y) for x, y in paired_points ]

如果你需要走这条路线，那么从元组中获得一个类也是值得的（最小成本优于普通元组）。如果需要，我可以提供详细信息。

PS 我看到很久以前@MattAnderson提到了对象元组问题。但这是一个重大影响（至少在我的盒子上），甚至在禁用垃圾收集之前。

               Original code: total time:  15.79
      tuple instead of Point: total time:  7.328
 namedtuple instead of Point: total time:  9.140

Answer 11

我不知道你能做多少。

您可以使用生成器来避免额外的内存分配。这给了我大约5％的加速。

first_points  = (int(p) for p in first_points .split(","))
second_points = (int(p) for p in second_points.split(","))
paired_points = itertools.izip(first_points, second_points)
curr_points   = [Point(x, y) for x,y in paired_points]

即使将整个循环折叠成一个庞大的列表理解也没什么用。

all_point_sets = [
    [Point(int(x), int(y)) for x, y in itertools.izip(xs.split(','), ys.split(','))]
    for xs, ys in data
]

如果你继续迭代这个大的列表，那么你可以把它变成一个生成器。这会分散解析CSV数据的成本，因此您不会受到很大的影响。

all_point_sets = (
    [Point(int(x), int(y)) for x, y in itertools.izip(xs.split(','), ys.split(','))]
    for xs, ys in data
)

Answer 12

对于长度为2000000的数组，内置函数（如zip(a,b)或map(int, string.split(","))）的时间可以忽略不计，我必须假设最耗时的操作是追加

因此解决问题的正确方法是递归地连接字符串：
10个10个元素的字符串到更大的字符串
10个100个元素的字符串
10个1000个元素的字符串

最后到zip(map(int,huge_string_a.split(",")),map(int,huge_string_b.split(",")));

然后进行微调以找到追加和征服方法的最佳基数N.

Answer 13

这里有很多好的答案。然而，到目前为止，这个问题的一个方面没有解决，是python中各种迭代器实现之间的列表到字符串的时间成本差异。

有一篇文章测试了Python.org essays list2str上列表到字符串转换的不同迭代器的效率。请记住，当我遇到类似的优化问题，但具有不同的数据结构和大小时，文章中提供的结果并非都以相同的速率扩展，因此值得为您的特定用例测试不同的迭代器实现。

在Python中加速将字符串配对到对象中

13 个答案:

split.py

split_.pyx