Question

我有一系列信号长度n = 36,000，需要对其进行互相关。目前，我在numpy中的cpu实现有点慢。我听说Pytorch可以大大加快张量操作，并提供了一种在GPU上并行执行计算的方法。我想探索这个选项，但是我不太确定如何使用框架来完成此操作。

由于这些信号的长度，我宁愿在频域中执行互相关运算。

通常使用numpy来执行以下操作：

import numpy as np

signal_length=36000

# make the signals
signal_1 = np.random.uniform(-1,1, signal_length)
signal_2 = np.random.uniform(-1,1, signal_length)

# output target length of crosscorrelation
x_cor_sig_length = signal_length*2 - 1

# get optimized array length for fft computation
fast_length = np.fftpack.next_fast_len(x_cor_sig_length)

# move data into the frequency domain. axis=-1 to perform 
# along last dimension
fft_1 = np.fft.rfft(src_data, fast_length, axis=-1)
fft_2 = np.fft.rfft(src_data, fast_length, axis=-1)

# take the complex conjugate of one of the spectrums. Which one you choose depends on domain specific conventions
fft_1 = np.conj(fft_1)


fft_multiplied = fft_1 * fft_2

# back to time domain. 
prelim_correlation = np.fft.irfft(result, x_corr_sig_length, axis=-1)

# shift the signal to make it look like a proper crosscorrelation,
# and transform the output to be purely real

final_result = np.real(np.fft.fftshift(prelim_correlation),axes=-1)).astype(np.float64)

看一下Pytorch文档，似乎numpy.conj（）并不等效。我也不确定在irfft操作之后是否/如何需要执行fftshift。

那么您将如何使用傅立叶方法在Pytorch中编写一维互相关？

Answer 1

实际上，请查看conv1d，其内容为：

second 0

实际上，即使它被称为卷积，它也是互相关的。

Answer 2

需要考虑的一些事项。

Python 解释器非常慢，那些矢量化库所做的就是将工作负载转移到本机实现上。为了有所作为，您需要能够在一条 python 指令中执行许多操作。在 GPU 上评估事物遵循相同的原则，虽然 GPU 具有更强的计算能力，但将数据复制到 GPU 或从 GPU 复制数据的速度较慢。

我调整了您的示例以同时处理多个信号。

import numpy as np
def numpy_xcorr(BATCH=1, signal_length=36000):
    # make the signals
    signal_1 = np.random.uniform(-1,1, (BATCH, signal_length))
    signal_2 = np.random.uniform(-1,1, (BATCH, signal_length))

    # output target length of crosscorrelation
    x_cor_sig_length = signal_length*2 - 1

    # get optimized array length for fft computation
    fast_length = next_fast_len(x_cor_sig_length)

    # move data into the frequency domain. axis=-1 to perform 
    # along last dimension
    fft_1 = np.fft.rfft(signal_1, fast_length, axis=-1)
    fft_2 = np.fft.rfft(signal_2 + 0.1 * signal_1, fast_length, axis=-1)

    # take the complex conjugate of one of the spectrums. 
    fft_1 = np.conj(fft_1)


    fft_multiplied = fft_1 * fft_2

    # back to time domain. 
    prelim_correlation = np.fft.irfft(fft_multiplied, fast_length, axis=-1)

    # shift the signal to make it look like a proper crosscorrelation,
    # and transform the output to be purely real

    final_result = np.fft.fftshift(np.real(prelim_correlation), axes=-1)
    return final_result, np.sum(final_result)

从 Torch 1.7 开始，我们有了 torch.fft 模块，它提供了一个类似于 numpy.fft 的接口，fftshift 丢失了，但使用 torch.roll 可以获得相同的结果。另一点是 numpy 默认使用 64 位精度，而 Torch 将使用 32 位精度。

快速长度包括选择平滑数（那些被分解为小素数的数，我想你对这个主题很熟悉）。

def next_fast_len(n, factors=[2, 3, 5, 7]):
    '''
      Returns the minimum integer not smaller than n that can
      be written as a product (possibly with repettitions) of
      the given factors.
    '''
    best = float('inf')
    stack = [1]
    while len(stack):
        a = stack.pop()
        if a >= n:
            if a < best:
                best = a;
                if best == n:
                    break; # no reason to keep searching
        else:
            for p in factors:
                b = a * p;
                if b < best:
                    stack.append(b)
    return best;

然后火炬实现

import torch;
import torch.fft
def torch_xcorr(BATCH=1, signal_length=36000, device='cpu', factors=[2,3,5], dtype=torch.float):
    signal_length=36000
    # torch.rand is random in the range (0, 1)
    signal_1 = 1 - 2*torch.rand((BATCH, signal_length), device=device, dtype=dtype)
    signal_2 = 1 - 2*torch.rand((BATCH, signal_length), device=device, dtype=dtype)

    # just make the cross correlation more interesting
    signal_2 += 0.1 * signal_1;

    # output target length of crosscorrelation
    x_cor_sig_length = signal_length*2 - 1

    # get optimized array length for fft computation
    fast_length = next_fast_len(x_cor_sig_length, [2, 3])

    # the last signal_ndim axes (1,2 or 3) will be transformed
    fft_1 = torch.fft.rfft(signal_1, fast_length, dim=-1)
    fft_2 = torch.fft.rfft(signal_2, fast_length, dim=-1)

    # take the complex conjugate of one of the spectrums. Which one you choose depends on domain specific conventions

    fft_multiplied = torch.conj(fft_1) * fft_2

    # back to time domain. 
    prelim_correlation = torch.fft.irfft(fft_multiplied, dim=-1)

    # shift the signal to make it look like a proper crosscorrelation,
    # and transform the output to be purely real

    final_result = torch.roll(prelim_correlation, (fast_length//2,))
    return final_result, torch.sum(final_result);

这里是测试结果的代码

import time
funcs = {'numpy-f64': lambda b: numpy_xcorr(b, factors=[2,3,5], dtype=np.float64), 
         'numpy-f32': lambda b: numpy_xcorr(b, factors=[2,3,5], dtype=np.float32), 
         'torch-cpu-f64': lambda b: torch_xcorr(b, device='cpu', factors=[2,3,5], dtype=torch.float64), 
         'torch-cpu': lambda b: torch_xcorr(b, device='cpu', factors=[2,3,5], dtype=torch.float32), 
         'torch-gpu-f64': lambda b: torch_xcorr(b, device='cuda', factors=[2,3,5], dtype=torch.float64),
         'torch-gpu': lambda b: torch_xcorr(b, device='cuda', factors=[2,3,5], dtype=torch.float32),
         }
times ={}
for batch in [1, 10, 100]:
    times[batch] = {}
    for l, f in funcs.items():
        t0 = time.time()
        t1, t2 = f(batch)
        tf = time.time()
        del t1
        del t2
        times[batch][l] = 1000 * (tf - t0) / batch;

我得到了以下结果

让我感到惊讶的是当数字不那么平滑时的结果，例如使用 17-smooth 长度时，torch 的实现要好得多，所以我在这里使用了对数刻度（批量大小为 100 时，torch gpu 比批量大小为 1 的 numpy 快 10000 倍）。

请记住，这些函数通常在 GPU 生成数据，我们希望将最终结果复制到 CPU，如果我们考虑将最终结果复制到 CPU 所花费的时间，我观察到的时间比互相关高 10 倍计算（随机数据生成 + 三个 FFT）。

如何在傅立叶域中对长信号实现Pytorch 1D互相关？

2 个答案: