Question

我正在解析非常大的压缩文件（5GB至 2TB ），并使用以下算法将一些数据存储到csv文件中。它可以工作，但是由于具有三个嵌套循环，因此与高效相反。

由于我不熟悉python提供的json和yaml库，因此我不确定几行代码的成本：

k = yaml.load(json.dumps(v))

如果您没有注意到，我已经调用了yaml.load()函数在该行的上方：

header = yaml.load(json.dumps(header))

似乎我不得不调用该函数两次，因为header的键的内部叶子（值）被解释为字符串。

当我仅在以下行：for k, v in header.iteritems():中打印出v的值时，输出通常看起来像是以下行之一：

[{'value': ['4-55251088-0 0NNN RT(1535855435726 0) q(0 -1 -1 -1) r(0 -1)'], 'key': 'x_iinfo'}]
[{'value': ['timeout=60'], 'key': 'keep_alive'}, {'value': ['Sun, 02 Sep 2018 02:30:35 GMT'], 'key': 'date'}]
[{'value': ['W/"12765-1490784752000"'], 'key': 'etag'}, {'value': ['Sun, 02 Sep 2018 02:27:16 GMT'], 'key': 'date'}]
[{'value': ['Sun, 02 Sep 2018 02:30:32 GMT'], 'key': 'date'}]

因此，基本上，如果在我们的文件中有一个名为“未知”的类别，这是一个json树，其中包括没有特定类别的所有内容。

是否有更好的方法来获取所有这些值，而又不增加两个循环来减慢算法速度？

完整方法来源：

def convertJsonHeadersToCSV(jsonFilePath, CSVFilePath,portNum, protocol):
  try:
    bodyPattern = re.compile('<(html|!DOCTYPE).*$', re.IGNORECASE | re.MULTILINE)
    csvFile = open(CSVFilePath, 'w')
    print("Converting " + protocol + " file to csv, please wait...")
    spinner.start()
    csvWriter = unicodecsv.writer(csvFile)
    csvWriter.writerow(['ip', 'date', 'protocol', 'port', 'data'])
    chunk_size = 128 * 1024 * 1024
    with lz4.frame.open(jsonFilePath, 'r') as f:
      for line in f:
        try:
          text = ""
          jsonData = json.loads(line)
          ts = jsonData['timestamp'][:10]
          ip = jsonData['ip']
          data = jsonData['data']['http']
          if 'response' in data:
            if 'headers' in data['response']:
              header = jsonData['data']['http']['response']['headers']
              header = yaml.load(json.dumps(header))
              for k, v in header.iteritems():
                if 'unknown' in k:
                  #print(v)
                  k = yaml.load(json.dumps(v))
                  for i in k:
                    #print(str(i['key']) + ": "+str(i['value']) + "\r\n")
                    text = text + str(str(i['key']) + ": "+str(i['value']) + "\r\n")
                else:
                  text = text + str(str(k) + ": "+str(v) + "\r\n")
              #csvWriter.writerow([ip, ts, protocol, portNum, text])

        except:#sometimes will run into a unicode error, still working on handling this exception.
          pass
    csvFile.close()
    spinner.stop()
    print("Completed conversion of " + protocol + " file.")
  except Exception as ex:
    spinner.stop()
    traceback.print_exc()
    print("An error occurred while converting the file, moving on to the next task...")

Answer 1

可以肯定的是，将极大地加速此操作的原因是停止使用text作为字符串，因为这些行：

    text = text + str(str(i['key']) + ": "+str(i['value']) + "\r\n")
else:
  text = text + str(str(k) + ": "+str(v) + "\r\n")

正在执行字符串串联。由于字符串是不可变的，因此每次都必须完成一个新副本（即使使用text +=而不是text = text +，所以也没有任何帮助），并且要复制的字符串越大，则速度越慢（二次复杂度。

最好：

将text定义为空列表
追加到列表
最后使用"".join

如此

 for line in f:
    try:
      text = []   # define an empty list at start
      jsonData = json.loads(line)

然后（在这里使用str?format也会有所改进，但这是次要的）

       text.append(str(str(i['key']) + ": "+str(i['value']) + "\r\n"))
    else:
      text.append(str(str(k) + ": "+str(v) + "\r\n"))

，最后将text突变为这样的字符串：

text = "".join(text)

或者只是

csvWriter.writerow([ip, ts, protocol, portNum, "".join(text)])

如何改善解析lz4压缩json的方法？

1 个答案: