我正在尝试与金属swift并行计算大数组的总和。
有没有神的方法呢?
我的飞机是我将数组划分为子数组,并行计算一个子数组的总和,然后当并行计算完成时计算子总和的总和。
例如,如果我有
array = [a0,....an]
我在子数组中划分数组:
array_1 = [a_0,...a_i],
array_2 = [a_i+1,...a_2i],
....
array_n/i = [a_n-1, ... a_n]
这个数组的总和是并行计算的,我得到了
sum_1, sum_2, sum_3, ... sum_n/1
最后只计算子和的总和。
我创建运行我的金属着色器的应用程序,但有些事情我不太了解。
var array:[[Float]] = [[1,2,3], [4,5,6], [7,8,9]]
// get device
let device: MTLDevice! = MTLCreateSystemDefaultDevice()
// get library
let defaultLibrary:MTLLibrary! = device.newDefaultLibrary()
// queue
let commandQueue:MTLCommandQueue! = device.newCommandQueue()
// function
let kernerFunction: MTLFunction! = defaultLibrary.newFunctionWithName("calculateSum")
// pipeline with function
let pipelineState: MTLComputePipelineState! = try device.newComputePipelineStateWithFunction(kernerFunction)
// buffer for function
let commandBuffer:MTLCommandBuffer! = commandQueue.commandBuffer()
// encode function
let commandEncoder:MTLComputeCommandEncoder = commandBuffer.computeCommandEncoder()
// add function to encode
commandEncoder.setComputePipelineState(pipelineState)
// options
let resourceOption = MTLResourceOptions()
let arrayBiteLength = array.count * array[0].count * sizeofValue(array[0][0])
let arrayBuffer = device.newBufferWithBytes(&array, length: arrayBiteLength, options: resourceOption)
commandEncoder.setBuffer(arrayBuffer, offset: 0, atIndex: 0)
var result:[Float] = [0,0,0]
let resultBiteLenght = sizeofValue(result[0])
let resultBuffer = device.newBufferWithBytes(&result, length: resultBiteLenght, options: resourceOption)
commandEncoder.setBuffer(resultBuffer, offset: 0, atIndex: 1)
let threadGroupSize = MTLSize(width: 1, height: 1, depth: 1)
let threadGroups = MTLSize(width: (array.count), height: 1, depth: 1)
commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadGroupSize)
commandEncoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
let data = NSData(bytesNoCopy: resultBuffer.contents(), length: sizeof(Float), freeWhenDone: false)
data.getBytes(&result, length: result.count * sizeof(Float))
print(result)
是我的Swift代码,
我的着色器是:
kernel void calculateSum(const device float *inFloat [[buffer(0)]],
device float *result [[buffer(1)]],
uint id [[ thread_position_in_grid ]]) {
float * f = inFloat[id];
float sum = 0;
for (int i = 0 ; i < 3 ; ++i) {
sum = sum + f[i];
}
result = sum;
}
我不知道如何定义inFloat是数组数组。 我不确切知道什么是threadGroupSize和threadGroups。 我不知道着色器属性中的设备和uint是什么。
这是正确的方法吗?
答案 0 :(得分:17)
我花时间用Metal创建了这个问题的完整工作示例。解释在评论中:
let count = 10_000_000
let elementsPerSum = 10_000
// Data type, has to be the same as in the shader
typealias DataType = CInt
let device = MTLCreateSystemDefaultDevice()!
let library = self.library(device: device)
let parsum = library.makeFunction(name: "parsum")!
let pipeline = try! device.makeComputePipelineState(function: parsum)
// Our data, randomly generated:
var data = (0..<count).map{ _ in DataType(arc4random_uniform(100)) }
var dataCount = CUnsignedInt(count)
var elementsPerSumC = CUnsignedInt(elementsPerSum)
// Number of individual results = count / elementsPerSum (rounded up):
let resultsCount = (count + elementsPerSum - 1) / elementsPerSum
// Our data in a buffer (copied):
let dataBuffer = device.makeBuffer(bytes: &data, length: MemoryLayout<DataType>.stride * count, options: [])!
// A buffer for individual results (zero initialized)
let resultsBuffer = device.makeBuffer(length: MemoryLayout<DataType>.stride * resultsCount, options: [])!
// Our results in convenient form to compute the actual result later:
let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
let results = UnsafeBufferPointer<DataType>(start: pointer, count: resultsCount)
let queue = device.makeCommandQueue()!
let cmds = queue.makeCommandBuffer()!
let encoder = cmds.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(dataBuffer, offset: 0, index: 0)
encoder.setBytes(&dataCount, length: MemoryLayout<CUnsignedInt>.size, index: 1)
encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
encoder.setBytes(&elementsPerSumC, length: MemoryLayout<CUnsignedInt>.size, index: 3)
// We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)
// Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()
var start, end : UInt64
var result : DataType = 0
start = mach_absolute_time()
cmds.commit()
cmds.waitUntilCompleted()
for elem in results {
result += elem
}
end = mach_absolute_time()
print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
result = 0
start = mach_absolute_time()
data.withUnsafeBufferPointer { buffer in
for elem in buffer {
result += elem
}
}
end = mach_absolute_time()
print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
我用我的Mac测试它,但它在iOS上应该可以正常工作。
输出:
Metal result: 494936505, time: 0.024611456
CPU result: 494936505, time: 0.163341018
Metal版本的速度提高了约7倍。我确定你可以通过截断或其他任何方式实施分而治之的方式来获得更快的速度。
答案 1 :(得分:1)
可接受的答案令人讨厌地缺少为其编写的内核。 The source is here,但这是可以作为快捷命令行应用程序运行的完整程序和着色器。
/*
* Command line Metal Compute Shader for data processing
*/
import Metal
import Foundation
//------------------------------------------------------------------------------
let count = 10_000_000
let elementsPerSum = 10_000
//------------------------------------------------------------------------------
typealias DataType = CInt // Data type, has to be the same as in the shader
//------------------------------------------------------------------------------
let device = MTLCreateSystemDefaultDevice()!
let library = device.makeDefaultLibrary()!
let parsum = library.makeFunction(name: "parsum")!
let pipeline = try! device.makeComputePipelineState(function: parsum)
//------------------------------------------------------------------------------
// Our data, randomly generated:
var data = (0..<count).map{ _ in DataType(arc4random_uniform(100)) }
var dataCount = CUnsignedInt(count)
var elementsPerSumC = CUnsignedInt(elementsPerSum)
// Number of individual results = count / elementsPerSum (rounded up):
let resultsCount = (count + elementsPerSum - 1) / elementsPerSum
//------------------------------------------------------------------------------
// Our data in a buffer (copied):
let dataBuffer = device.makeBuffer(bytes: &data, length: MemoryLayout<DataType>.stride * count, options: [])!
// A buffer for individual results (zero initialized)
let resultsBuffer = device.makeBuffer(length: MemoryLayout<DataType>.stride * resultsCount, options: [])!
// Our results in convenient form to compute the actual result later:
let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
let results = UnsafeBufferPointer<DataType>(start: pointer, count: resultsCount)
//------------------------------------------------------------------------------
let queue = device.makeCommandQueue()!
let cmds = queue.makeCommandBuffer()!
let encoder = cmds.makeComputeCommandEncoder()!
//------------------------------------------------------------------------------
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(dataBuffer, offset: 0, index: 0)
encoder.setBytes(&dataCount, length: MemoryLayout<CUnsignedInt>.size, index: 1)
encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
encoder.setBytes(&elementsPerSumC, length: MemoryLayout<CUnsignedInt>.size, index: 3)
//------------------------------------------------------------------------------
// We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)
// Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)
//------------------------------------------------------------------------------
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()
//------------------------------------------------------------------------------
var start, end : UInt64
var result : DataType = 0
//------------------------------------------------------------------------------
start = mach_absolute_time()
cmds.commit()
cmds.waitUntilCompleted()
for elem in results {
result += elem
}
end = mach_absolute_time()
//------------------------------------------------------------------------------
print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
//------------------------------------------------------------------------------
result = 0
start = mach_absolute_time()
data.withUnsafeBufferPointer { buffer in
for elem in buffer {
result += elem
}
}
end = mach_absolute_time()
print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
//------------------------------------------------------------------------------
#include <metal_stdlib>
using namespace metal;
typedef unsigned int uint;
typedef int DataType;
kernel void parsum(const device DataType* data [[ buffer(0) ]],
const device uint& dataLength [[ buffer(1) ]],
device DataType* sums [[ buffer(2) ]],
const device uint& elementsPerSum [[ buffer(3) ]],
const uint tgPos [[ threadgroup_position_in_grid ]],
const uint tPerTg [[ threads_per_threadgroup ]],
const uint tPos [[ thread_position_in_threadgroup ]]) {
uint resultIndex = tgPos * tPerTg + tPos;
uint dataIndex = resultIndex * elementsPerSum; // Where the summation should begin
uint endIndex = dataIndex + elementsPerSum < dataLength ? dataIndex + elementsPerSum : dataLength; // The index where summation should end
for (; dataIndex < endIndex; dataIndex++)
sums[resultIndex] += data[dataIndex];
}
答案 2 :(得分:0)
我一直在运行该应用程序。在gt 740(384核)与采用多线程矢量和实现的i7-4790上,这是我的数据:
Metal lap time: 19.959092
cpu MT lap time: 4.353881
这是cpu的5/1比率, 因此,除非您拥有使用着色器的强大GPU,否则不值得。
我一直在使用igpu intel hd 4000的i7-3610qm中测试相同的代码,令人惊讶的是,对于金属,结果要好得多:2/1
编辑:在调整了线程参数之后,我终于提高了GPU性能,现在达到了16xcpu