Question

在关于可配置嵌入式系统的大学课程中（在ZYNQ-7010上），我们最近实现了一个（幼稚）低通图像滤波器，它将1维高斯核（0.25 * [1 2 1]）应用于数据来自Block RAM。

我们决定缓存（即排队）三个像素，然后在数据输出过程中对它们进行在线操作。我们的第一种方法是使用三个过程变量并将它们翻转到

中

pixel[k-2] := pixel[k-1];
pixel[k-1] := pixel[k];
pixel[k]   := RAM(address);

时尚;以下是完整的过程：

process (clk25)
    -- queue
    variable pixelMinus2  : std_logic_vector(11 downto 0) := (others => '0');
    variable pixelMinus1  : std_logic_vector(11 downto 0) := (others => '0');
    variable pixelCurrent : std_logic_vector(11 downto 0) := (others => '0');

    -- temporaries
    variable r : unsigned(3 downto 0);
    variable g : unsigned(3 downto 0);
    variable b : unsigned(3 downto 0);
begin
    if clk25'event and clk25 = '1' then
        pixelMinus2  := pixelMinus1;
        pixelMinus1  := pixelCurrent;
        pixelCurrent := RAM(to_integer(UNSIGNED(addrb)));

        IF slv_reg0(3) = '0' THEN 
            -- bypass filter for debugging
            dob <= pixelCurrent;
        ELSE
            -- colors are 4 bit each in a 12 bit vector
            -- division by 4 is done by right shifting by 2
            r := (
                          ("00" & unsigned(pixelMinus2(11 downto 10)))
                        + ("00" & unsigned(pixelMinus1(11 downto 10)))
                        + ("00" & unsigned(pixelMinus1(11 downto 10)))
                        + ("00" & unsigned(pixelCurrent(11 downto 10)))
                    );

            g :=  (
                          ("00" & unsigned(pixelMinus2(7 downto 6)))
                        + ("00" & unsigned(pixelMinus1(7 downto 6)))
                        + ("00" & unsigned(pixelMinus1(7 downto 6)))
                        + ("00" & unsigned(pixelCurrent(7 downto 6)))
                    );

            b :=  (
                          ("00" & unsigned(pixelMinus2(3 downto 2)))
                        + ("00" & unsigned(pixelMinus1(3 downto 2)))
                        + ("00" & unsigned(pixelMinus1(3 downto 2)))
                        + ("00" & unsigned(pixelCurrent(3 downto 2)))
                    );

            dob <= std_logic_vector(r) & std_logic_vector(g) & std_logic_vector(b);
        END IF;
    end if;
end process;

然而事实证明这是非常错误的;合成将需要很长时间，并导致估计的LUT使用率约为设备容量的130％。

我们后来将实施更改为使用信号而不是变量，而解决了所有问题;硬件表现如预期，LUT使用率下降到一定百分比。

我的问题是在使用变量时导致问题的原因是，根据我们的理解，它应该像那样工作。

Answer 1

作为进程间通信手段的信号具有精心设计的分配语义，以避免竞争条件和危险。有关详细信息，请参阅this Q&A和this link to "VHDL's crown jewel"。

因此，当您指定pixelCurrent（信号）

时

pixelCurrent <= RAM(to_integer(UNSIGNED(addrb)));

在流程暂停之前（RTL代码通常在流程退出时和灵敏度列表中），分配才会发生，并且在此过程中结果不可用，直到下一次唤醒{{ 1}}。所以这会创建一个管道寄存器。

VHDL流程中的变量就像任何其他命令式语言（C等）中的变量一样 - 一旦更新，它们的新值立即可用。

因此如下：

if rising_edge(clk25)

将pixelCurrent的NEW值传播到剩余的进程中，生成一个巨大的设计，试图在一个时钟周期内完成所有事情。

有两种解决方案：我首选的是将信号用于流水线寄存器，因为您可以以最自然的方式describe the pipeline（首先是第一阶段）。

第二种解决方案，使用变量作为流水线寄存器 - 具有讽刺意味的是，你已经部分采用了这种解决方案 -

pixelCurrent := RAM(to_integer(UNSIGNED(addrb)));

IF slv_reg0(3) = '0' THEN 
    -- bypass filter for debugging
    dob <= pixelCurrent;

用于描述管道BACKWARDS，以便在最后一次使用其值之后对变量进行赋值。

只需在大pixelMinus2 := pixelMinus1; pixelMinus1 := pixelCurrent; pixelCurrent := RAM(to_integer(UNSIGNED(addrb)));之后移动这三个作业，你的变量版就可以了。

确认两种方法都生成相同的硬件后，选择您认为最简单（最容易理解）设计的方法。

Answer 2

当变量在过程中用于pixelCurrent时，则值为立即更新并可用，其中信号的值尚未准备好直到下一个周期。

因此，当使用变量时，此行实现具有异步的RAM 阅读基于addrb：

pixelCurrent := RAM(to_integer(UNSIGNED(addrb)));

对信号的赋值将实现具有同步读取的RAM，其中从RAM读取的值在下一个周期之前不可用。

典型的FPGA技术为具有同步功能的RAM提供专用硬件读取，但具有异步的RAM使用组合逻辑（查找表/ LUT）。

因此使用变量时出现的大量LUT pixelCurrent是因为综合工具尝试映射RAM 异步读入LUT ，这通常需要大量的LUT 并使得生成的RAM非常慢。

在流水线设计中，它听起来像是异步RAM读取必需，所以如果pixelCurrent是一个信号，则使用同步RAM 并且综合工具将RAM映射到内部RAM硬件块代码如：

pixelMinus2  := pixelMinus1;
pixelMinus1  := pixelCurrent;
pixelCurrent <= RAM(to_integer(UNSIGNED(addrb)));

VHDL - 队列中的变量与信号行为

2 个答案: