Question

我正在尝试加速我将[UInt32]转换为[UInt8]的函数的当前实现，后者又被分成[[UInt8]]，每个索引有6个数组。

我的实施：

extension Array {
func splitBy(subSize: Int) -> [[Element]] {
    return 0.stride(to: self.count, by: subSize).map { startIndex in
        let endIndex = startIndex.advancedBy(subSize, limit: self.count)
        return Array(self[startIndex ..< endIndex])
    }
  }
}



func convertWordToBytes(fullW : [UInt32]) -> [[UInt8]] {
    var combined8 = [UInt8]()

    //Convert 17 [UInt32] to 68 [UInt8]
    for i in 0...16{
        _ = 24.stride(through: 0, by: -8).map {
            combined8.append(UInt8(truncatingBitPattern: fullW[i] >> UInt32($0)))
        }
    }

    //Split [UInt8] to [[UInt8]] with 6 values at each index.
    let combined48 = combined8.splitBy(6) 

    return combined48
}

这个函数将在我的程序中重复数百万次，其速度是一个巨大的负担。

有人有任何想法吗？感谢

Answer 1

如果您对代码进行了分析（Cmd + I），您将看到大部分时间都在各种“复制到缓冲区”功能上。当您向数组追加一个新元素但它已用完其初始分配空间时，会发生这种情况，因此必须将其移动到堆上具有更多内存的位置。课程的道德：堆分配缓慢但数组不可避免。尽量少做几次。

试试这个：

func convertWordToBytes2(fullW: [UInt32]) -> [[UInt8]] { let subSize = 6 // We allocate the array only once per run since allocation is so slow // There will only be assignment to it after var combined48 = [UInt8](count: fullW.count * 4, repeatedValue: 0).splitBy(subSize) var row = 0 var col = 0 for i in 0...16 { for j in 24.stride(through: 0, by: -8) { let value = UInt8(truncatingBitPattern: fullW[i] >> UInt32(j)) combined48[row][col] = value col += 1 if col >= subSize { row += 1 col = 0 } } } return combined48 }

基准代码：

let testCases = (0..<1_000_000).map { _ in (0..<17).map { _ in arc4random() } } testCases.forEach { convertWordToBytes($0) convertWordToBytes2($0) }

结果（在我的2012 iMac上）

Weight Self Weight Symbol Name 9.35 s 53.2% 412.00 ms specialized convertWordToBytes([UInt32]) -> [[UInt8]] 3.28 s 18.6% 344.00 ms specialized convertWordToBytes2([UInt32]) -> [[UInt8]]

通过消除多次分配，我们已将运行时间缩短了60％。但每个测试用例都是独立的，这非常适合与当今的多核CPU并行处理。修改后的循环......：

dispatch_apply(testCases.count, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0)) { i in convertWordToBytes2(testCases[i]) }

...在使用8个线程的四核i7上执行时，会在大约1秒钟的时间内缩短时间：

Weight Self Weight Symbol Name 2.28 s 6.4% 0 s _dispatch_worker_thread3 0x58467 2.24 s 6.3% 0 s _dispatch_worker_thread3 0x58463 2.22 s 6.2% 0 s _dispatch_worker_thread3 0x58464 2.21 s 6.2% 0 s _dispatch_worker_thread3 0x58466 2.21 s 6.2% 0 s _dispatch_worker_thread3 0x58465 2.21 s 6.2% 0 s _dispatch_worker_thread3 0x58461 2.18 s 6.1% 0 s _dispatch_worker_thread3 0x58462

节省的时间并不像我希望的那么多。显然，访问堆内存时存在一些争用。对于任何更快的事情，您应该探索基于C的解决方案。

转换[UInt32] - ＆gt; [UInt8] - ＆gt; Swift中的[[UInt8]]

1 个答案: