Question

我有一个大文件，我将在其中解析1.9E8行。

在每次迭代中，我将创建一个临时字典发送给另一个方法，这将给我想要的输出。

由于文件太大，我无法用readlines（）方法打开它。

所以我最后的办法是让它更快，就是在解析过程中。

我已经有两个生成字典的选项。 optionB 的性能优于 optionA ，我知道我可以试用正则表达式，但我不熟悉它。如果有的话，我愿意接受更好的替代品的见解。

预期输入 "A@1:100;2:240;...:.." 输入可能更长，可以有更多群组及其频率

def optionA(line):
    _id, info = line.split("@")
    data = {}
    for g_info in info.split(";"):
        k, v = g_info.split(":")
        data[k] = v
    return data

def optionB(line):
    _id, info = line.split("@")
    return dict(map(lambda i: i.split(":"), info.split(";")))

预期输出： {'1': '100', '2': '240'}

我愿意接受任何推荐！

Answer 1

解析该行的正则表达式的快速示例：

$scope.mainEvents = [];
$scope.eventsF = function (start, end, timezone, callback) {
  //moment.js objects
  var s = start.format('YYYY-MM-DD'),
      e = end.format('YYYY-MM-DD');
  Event.getEvents(s, e).then(function (events) {
    //emptying the array
    $scope.mainEvents.splice(0, $scope.mainEvents.length);

    //add the retrieved events
    angular.forEach(events, function (value) {
        this.push(value);
      }, $scope.mainEvents);
  }, function (error) {
    //promise error logic
    console.log(error);
  });
};

以下是一些时间：

>>> import re
>>> line = 'A@1:100;2:240'
>>> data = re.search(r'@(\d+):(\d+);(\d+):(\d+)',line).groups()
>>> D = {data[0]:data[1],data[2]:data[3]}
>>> D
{'1': '100', '2': '240'}

时间：

import re
regex = re.compile(r'@(\d+):(\d+);(\d+):(\d+)')

def optionA(line):
    _id, info = line.split("@")
    data = {}
    for g_info in info.split(";"):
        k, v = g_info.split(":")
        data[k] = v
    return data

def optionB(line):
    _id, info = line.split("@")
    return dict(map(lambda i: i.split(":"), info.split(";")))

def optionC(line):
    data = regex.search(line).groups()
    return {data[0]:data[1],data[2]:data[3]}

line = 'A@1:100;2:240'

修改：由于需求略有变化，我为C:\>py -m timeit -s "import x" "x.optionA(x.line)" 100000 loops, best of 3: 3.01 usec per loop C:\>py -m timeit -s "import x" "x.optionB(x.line)" 100000 loops, best of 3: 5.15 usec per loop C:\>py -m timeit -s "import x" "x.optionC(x.line)" 100000 loops, best of 3: 2.88 usec per loop尝试findall，optionC版本略有不同：

optionA

时序：

import re
regex = re.compile(r'(\d+):(\d+)')

def optionA(line):
    _id, info = line.split("@")
    data = {}
    for g_info in info.split(";"):
        k, v = g_info.split(":")
        data[k] = v
    return data

def optionAA(line):
    data = {}
    for g_info in line[2:].split(";"):
        k, v = g_info.split(":")
        data[k] = v
    return data

def optionB(line):
    _id, info = line.split("@")
    return dict(map(lambda i: i.split(":"), info.split(";")))

def optionC(line):
    return dict(regex.findall(line))

line = 'A@1:100;2:240;3:250;4:260;5:100;6:100;7:100;8:100;9:100;10:100'

所以看起来修改后的C:\>py -m timeit -s "import x" "x.optionA(x.line)" 100000 loops, best of 3: 8.35 usec per loop C:\>py -m timeit -s "import x" "x.optionAA(x.line)" 100000 loops, best of 3: 8.17 usec per loop C:\>py -m timeit -s "import x" "x.optionB(x.line)" 100000 loops, best of 3: 12.3 usec per loop C:\>py -m timeit -s "import x" "x.optionC(x.line)" 100000 loops, best of 3: 12.8 usec per loop会胜过这条特定的行。希望这表明测量算法的重要性。我很惊讶optionAA速度较慢。

Answer 2

以下是使用已编译的正则表达式匹配您的模式的简单示例。

import re

s = "A@1:100;2:240"
compiledre = re.compile("A@(\d+):(\d+);(\d+):(\d+)$")
res = compiledre.search(s)
if res:
    print dict([(res.group(1),res.group(2)),(res.group(3),res.group(4))])

输出是：

{'1': '100', '2': '240'}

从字符串创建python字典的最佳方法是什么？

2 个答案: