这篇文章用于记录我编写AXI接口对客户的PE单元进行封装的思路。同时也能更加深入了解AXI协议以及验证手段。
一、设计前观测
用户的PE单元是SPWV算法实现的单元,有九个接口(a_ptr,a_col,a_val,b_ptr...)。要求我使用三个AXI接口进行封装(A、B、C矩阵各一个接口)。这里先分享第一个接口:A矩阵接口。
先看用户的pe单元有哪些部分需要连接到我的AXI接口上(input a_ptr_addr,a_ptr_addr_v,a_col_addr... output a_ptr_data,a_ptr_data_v,a_col_data...),并且观察波形文件,得到结论:addrv和addr同步,datav和data同步,并且发出addrv后如果没有收到datav,pe不准再发相应通道的addrv;pe单元各通道发出的地址都是从0递增1的,这为我们的设计缓解了很大压力。
二、设计思路
用户需要接在AXI系统中,因此必须使用AXI4接口,并且数据延迟要尽可能小。因此结合递增机制,我想到预取。预取还不够,pe必须要有addrv才能拉高datav,再结合递增机制,我加入了三个FWFT fifo。为了释放性能,我还加入了突发传输。
所以数据流就变成HBM(u280)<-> AXI4 slave <-> AXI4 master <-> FIFO <-> PE CHANNAL A .
三、工程实现
1、自己写一个完整的AXI接口不现实,最好最好就是找开源代码修改,于是我直接上Vivado生成AXI4 master模板,这个模板代码非常冗余,需要耗费一些时间来删除信号和地址读逻辑,但是他的握手协议不能删除和修改,状态机保留(后面需要大改)
2、把无用信号删除后,实例化三个FIFO(处于便捷,我这里使用AI生成的FWFT FIFO进行仿真)。然后就是将FIFO上的接口连接好,fifo是pe要读数据的唯一来源,因此需要连接pe的data、datav,(这一步需要谨慎,ARID、RID都必须对应上,并且FIFO ip核绝对会有rst_busy信号,但是我这里第一版并没有引入),fifo连线可看后续代码。
3、当AXI模板只剩下握手信号,FIFO线路连接好,接下来就是考虑怎么把pe的addrv,addr,axi与fifo连接起来了。pe的addr和addrv就是读fifo使能,但是fifo存在空的情况,因此如果addrv来了但是fifo为空,就向总线发出一次读请求(状态机实现)。好了最基本的接口已经实现了。
4、需要考虑的点:此接口是三通道共用一接口,可能导致三个同时需要访问总线内存,两种解决方法:一、一通道在途时,其他通道不准插队,这样只需要记录其他信号的状态(使用pending记录他们的读请求和读地址,等读完再来)。二、记录读请求,但是重新设计状态机,让状态机成为只提供地址通道握手的机器,不断往总线发送读请求,记录ARID和解析RID实现数据分流。 这里为了快选择第二个。事实上第二个才是正解,通过乱序传输来多次访问数据,至于地址数据仲裁交给AXI slave和HBM那头去搞。
5、前面提到的预取机制,但是我们现在是empty && addrv才发出读请求,浪费时间,实际上递增机制已经告诉我们下一个地址,下下个直到第0xffffffff个地址的数据都是能提前知道的。所以我们这里只要almost_empty(阈值为270,fifo深度为1024)为1就发出对应通道读请求,这里保证pe一请求访问数据,A矩阵的数据几乎是零延迟回去(当然为了时序,还是打一拍再回去)。
四、工程验证
使用axi vip进行验证,这里需要注意,一定要下载并且导入axi vip资源包,不然绝对报错,建议去AMD官网查看axi vip添加教程(这里使用v版本,不用上block design)。
之后让ai生成testbench,模拟用户pe的输入输出,axi vip负责作为内存存储数据和检查协议是否出错(这里有cache,region,qos这些,几乎用不到,可以先删除)。
五、压力测试
使用高频的同周期三通道访问,试试看会不会出现异常,让ai生成这样的testbench,然后查看波形,等待tcl打印pass通过。
六,代码展示(仅source部分,testbench没动脑,自行找ai写即可)
点击查看代码
`timescale 1 ns / 1 psmodule pe_system_v1_0_m00_axi #(// Users to add parameters here// User parameters ends// Do not modify the parameters beyond this line// Base address of targeted slaveparameter C_M_TARGET_SLAVE_BASE_ADDR = 32'h40000000,// Burst Length. Supports 1, 2, 4, 8, 16, 32, 64, 128, 256 burst lengthsparameter integer C_M_AXI_BURST_LEN = 256,// Thread ID Widthparameter integer C_M_AXI_ID_WIDTH = 2,// Width of Address Busparameter integer C_M_AXI_ADDR_WIDTH = 32,// Width of Data Busparameter integer C_M_AXI_DATA_WIDTH = 32,// Width of User Write Address Busparameter integer C_M_AXI_AWUSER_WIDTH = 0,// Width of User Read Address Busparameter integer C_M_AXI_ARUSER_WIDTH = 0,// Width of User Write Data Busparameter integer C_M_AXI_WUSER_WIDTH = 0,// Width of User Read Data Busparameter integer C_M_AXI_RUSER_WIDTH = 0,// Width of User Response Busparameter integer C_M_AXI_BUSER_WIDTH = 0)(// Users to add ports here//thie axi interface only support for matA readinginput wire [C_M_AXI_ADDR_WIDTH-1:0] a_ptr_addr ,//input readable signalsinput wire a_ptr_addr_v , input wire [C_M_AXI_ADDR_WIDTH-1:0] a_col_addr ,input wire a_col_addr_v , input wire [C_M_AXI_ADDR_WIDTH-1:0] a_val_addr ,input wire a_val_addr_v , output wire[C_M_AXI_DATA_WIDTH-1:0] a_ptr_data ,output wire a_ptr_data_v ,output wire[C_M_AXI_DATA_WIDTH-1:0] a_col_data ,output wire a_col_data_v ,output wire[C_M_AXI_DATA_WIDTH-1:0] a_val_data ,output wire a_val_data_v ,input wire[C_M_AXI_ADDR_WIDTH-1:0] a_ptr_base_addr ,input wire[C_M_AXI_ADDR_WIDTH-1:0] a_col_base_addr ,input wire[C_M_AXI_ADDR_WIDTH-1:0] a_val_base_addr ,// User ports ends// Do not modify the ports beyond this lineinput wire INIT_AXI_TXN,output wire TXN_DONE,output reg ERROR,input wire M_AXI_ACLK,input wire M_AXI_ARESETN,output wire [C_M_AXI_ID_WIDTH-1 : 0] M_AXI_ARID,output wire [C_M_AXI_ADDR_WIDTH-1 : 0] M_AXI_ARADDR,output wire [7 : 0] M_AXI_ARLEN,output wire [2 : 0] M_AXI_ARSIZE,output wire [1 : 0] M_AXI_ARBURST,output wire M_AXI_ARLOCK,output wire [3 : 0] M_AXI_ARCACHE,output wire [2 : 0] M_AXI_ARPROT,output wire [3 : 0] M_AXI_ARQOS,output wire [C_M_AXI_ARUSER_WIDTH-1 : 0] M_AXI_ARUSER,output wire M_AXI_ARVALID,input wire M_AXI_ARREADY,input wire [C_M_AXI_ID_WIDTH-1 : 0] M_AXI_RID,input wire [C_M_AXI_DATA_WIDTH-1 : 0] M_AXI_RDATA,input wire [1 : 0] M_AXI_RRESP,input wire M_AXI_RLAST,input wire [C_M_AXI_RUSER_WIDTH-1 : 0] M_AXI_RUSER,input wire M_AXI_RVALID,output wire M_AXI_RREADY);function integer clogb2 (input integer bit_depth); begin for(clogb2=0; bit_depth>0; clogb2=clogb2+1) bit_depth = bit_depth >> 1; end endfunction // C_TRANSACTIONS_NUM is the width of the index counter for // number of write or read transaction.localparam integer C_TRANSACTIONS_NUM = clogb2(C_M_AXI_BURST_LEN-1);localparam integer C_MASTER_LENGTH = 12;localparam integer C_NO_BURSTS_REQ = C_MASTER_LENGTH-clogb2((C_M_AXI_BURST_LEN*C_M_AXI_DATA_WIDTH/8)-1);// Example State machine to initialize counter, initialize write transactions, // initialize read transactions and comparison of read data with the // written data words.parameter [1:0] IDLE = 2'b00, // This state initiates AXI4Lite transaction // after the state machine changes state to INIT_WRITE // when there is 0 to 1 transition on INIT_AXI_TXNBUSY = 2'b01 ;reg [1:0] mst_exec_state;// AXI4LITE signals//AXI4 internal temp signalsreg [C_M_AXI_ADDR_WIDTH-1 : 0] axi_araddr;reg axi_arvalid;reg axi_rready;//read beat count in a burstreg [C_TRANSACTIONS_NUM : 0] read_index;//size of C_M_AXI_BURST_LEN length burst in byteswire [C_TRANSACTIONS_NUM+2 : 0] burst_size_bytes;//The burst counters are used to track the number of burst transfers of C_M_AXI_BURST_LEN burst length needed to transfer 2^C_MASTER_LENGTH bytes of data.reg [C_NO_BURSTS_REQ : 0] read_burst_counter;reg start_single_burst_read;reg reads_done;reg error_reg;reg burst_read_active;//Interface response error flagswire read_resp_error;wire rnext;reg init_txn_ff;reg init_txn_ff2;reg init_txn_edge;wire init_txn_pulse;// I/O Connections assignments//Read Address (AR)//assign M_AXI_ARID = 'b0;assign M_AXI_ARADDR = axi_araddr;//Burst LENgth is number of transaction beats, minus 1assign M_AXI_ARLEN = C_M_AXI_BURST_LEN - 1;//Size should be C_M_AXI_DATA_WIDTH, in 2^n bytes, otherwise narrow bursts are usedassign M_AXI_ARSIZE = clogb2((C_M_AXI_DATA_WIDTH/8)-1);//INCR burst type is usually used, except for keyhole burstsassign M_AXI_ARBURST = 2'b01;assign M_AXI_ARLOCK = 1'b0;//Update value to 4'b0011 if coherent accesses to be used via the Zynq ACP port. Not Allocated, Modifiable, not Bufferable. Not Bufferable since this example is meant to test memory, not intermediate cache. assign M_AXI_ARCACHE = 4'b0010;assign M_AXI_ARPROT = 3'h0;assign M_AXI_ARQOS = 4'h0;assign M_AXI_ARUSER = 'b1;assign M_AXI_ARVALID = axi_arvalid;//Read and Read Response (R)assign M_AXI_RREADY = axi_rready;//Example design I/O//assign TXN_DONE = compare_done;//Burst size in bytesassign burst_size_bytes = C_M_AXI_BURST_LEN * C_M_AXI_DATA_WIDTH/8;assign init_txn_pulse = 1'b0;//Generate a pulse to initiate AXI transaction.always @(posedge M_AXI_ACLK) begin // Initiates AXI transaction delay if (M_AXI_ARESETN == 0 ) begin init_txn_ff <= 1'b0; init_txn_ff2 <= 1'b0; end else begin init_txn_ff <= INIT_AXI_TXN;init_txn_ff2 <= init_txn_ff; end end //----------------------------//Read Address Channel//----------------------------//The Read Address Channel (AW) provides a similar function to the//Write Address channel- to provide the tranfer qualifiers for the burst.//In this example, the read address increments in the same//manner as the write address channel.always @(posedge M_AXI_ACLK) begin if (M_AXI_ARESETN == 0 || init_txn_pulse == 1'b1 ) begin axi_arvalid <= 1'b0; end // If previously not valid , start next transaction else if (~axi_arvalid && start_single_burst_read) begin axi_arvalid <= 1'b1; end else if (M_AXI_ARREADY && axi_arvalid) begin axi_arvalid <= 1'b0; end else axi_arvalid <= axi_arvalid; end //record undefine axi_araddr ; // Next address after ARREADY indicates previous address acceptance // always @(posedge M_AXI_ACLK) // begin // if (M_AXI_ARESETN == 0 || init_txn_pulse == 1'b1) // begin // axi_araddr <= 'b0; // end // else if (M_AXI_ARREADY && axi_arvalid) // begin // axi_araddr <= axi_araddr + burst_size_bytes; // end // else // axi_araddr <= axi_araddr; // end //--------------------------------//Read Data (and Response) Channel//--------------------------------// Forward movement occurs when the channel is valid and ready assign rnext = M_AXI_RVALID && axi_rready; // Burst length counter. Uses extra counter register bit to indicate // terminal count to reduce decode logic always @(posedge M_AXI_ACLK) begin if (M_AXI_ARESETN == 0 || init_txn_pulse == 1'b1 || start_single_burst_read) begin read_index <= 0; end else if (rnext && (read_index != C_M_AXI_BURST_LEN-1)) begin read_index <= read_index + 1; end else read_index <= read_index; end /* The Read Data channel returns the results of the read request In this example the data checker is always able to accept more data, so no need to throttle the RREADY signal */ always @(posedge M_AXI_ACLK) begin if (M_AXI_ARESETN == 0 || init_txn_pulse == 1'b1 ) begin axi_rready <= 1'b0; end // accept/acknowledge rdata/rresp with axi_rready by the master // when M_AXI_RVALID is asserted by slave else if (M_AXI_RVALID) begin if (M_AXI_RLAST && axi_rready) begin axi_rready <= 1'b0; end else begin axi_rready <= 1'b1; end end // retain the previous value end //Flag any read response errors assign read_resp_error = axi_rready & M_AXI_RVALID & M_AXI_RRESP[1]; //----------------------------------//Example design error register//----------------------------------//Register and hold any data mismatches, or read/write interface errors always @(posedge M_AXI_ACLK) begin if (M_AXI_ARESETN == 0 || init_txn_pulse == 1'b1) begin error_reg <= 1'b0; end else if ( read_resp_error) begin error_reg <= 1'b1; end else error_reg <= error_reg; end //implement master command interface state machine // 1. 定义三个偏移寄存器reg [C_M_AXI_ADDR_WIDTH-1:0] ptr_offset, col_offset, val_offset;// 2. 握手驱动地址累加 (Ptr 路示例)always @(posedge M_AXI_ACLK) beginif (!M_AXI_ARESETN) beginptr_offset <= 0;end else if (M_AXI_ARVALID && M_AXI_ARREADY && (M_AXI_ARID == 2'b00)) begin// 只有当 Ptr 的地址请求被收下时,才指向下一段ptr_offset <= ptr_offset + (C_M_AXI_BURST_LEN * 4);endend// Col 路always @(posedge M_AXI_ACLK) beginif (!M_AXI_ARESETN) begincol_offset <= 0;end else if (M_AXI_ARVALID && M_AXI_ARREADY && (M_AXI_ARID == 2'b01)) begincol_offset <= col_offset + (C_M_AXI_BURST_LEN * 4);endend// Val 路always @(posedge M_AXI_ACLK) beginif (!M_AXI_ARESETN) beginval_offset <= 0;end else if (M_AXI_ARVALID && M_AXI_ARREADY && (M_AXI_ARID == 2'b10)) beginval_offset <= val_offset + (C_M_AXI_BURST_LEN * 4);endend always @ ( posedge M_AXI_ACLK) begin if (M_AXI_ARESETN == 1'b0 ) begin mst_exec_state <= IDLE; start_single_burst_read <= 1'b0; ERROR <= 1'b0;axi_araddr <= 'b0 ; current_arid <= 2'b00 ;end else begin case (mst_exec_state)IDLE: beginif (!burst_read_active && !start_single_burst_read) beginif (a_ptr_almost_empty) begincurrent_arid <= 2'b00;axi_araddr <= a_ptr_base_addr + ptr_offset; // 直接取当前的偏移start_single_burst_read <= 1'b1;mst_exec_state <= BUSY;endelse if (a_col_almost_empty) begincurrent_arid <= 2'b01;axi_araddr <= a_col_base_addr + col_offset;start_single_burst_read <= 1'b1;mst_exec_state <= BUSY;endelse if (a_val_almost_empty) begincurrent_arid <= 2'b10;axi_araddr <= a_val_base_addr + val_offset;start_single_burst_read <= 1'b1;mst_exec_state <= BUSY;endendendBUSY: beginstart_single_burst_read <= 1'b0; // 进入 BUSY 立即拉低脉冲if (!burst_read_active) beginmst_exec_state <= IDLE;endenddefault : begin mst_exec_state <= IDLE; end endcase end end //MASTER_EXECUTION_PROC // burst_read_active signal is asserted when there is a burst write transaction // is initiated by the assertion of start_single_burst_write. start_single_burst_read // signal remains asserted until the burst read is accepted by the master always @(posedge M_AXI_ACLK) begin if (M_AXI_ARESETN == 0 || init_txn_pulse == 1'b1) burst_read_active <= 1'b0; //The burst_write_active is asserted when a write burst transaction is initiated else if (start_single_burst_read) burst_read_active <= 1'b1; else if (M_AXI_RVALID && axi_rready && M_AXI_RLAST) burst_read_active <= 0; end // Check for last read completion. // This logic is to qualify the last read count with the final read // response. This demonstrates how to confirm that a read has been // committed. always @(posedge M_AXI_ACLK) begin if (M_AXI_ARESETN == 0 || init_txn_pulse == 1'b1) reads_done <= 1'b0; //The reads_done should be associated with a rready response //else if (M_AXI_BVALID && axi_bready && (write_burst_counter == {(C_NO_BURSTS_REQ-1){1}}) && axi_wlast)else if (M_AXI_RVALID && axi_rready && (read_index == C_M_AXI_BURST_LEN-1) )reads_done <= 1'b1; else reads_done <= reads_done; end // Add user logic here// FIFO 状态信号声明wire a_ptr_empty, a_col_empty, a_val_empty;wire a_ptr_almost_empty, a_col_almost_empty, a_val_almost_empty;reg a_ptr_fifo_re_reg ,a_col_fifo_re_reg ,a_val_fifo_re_reg;reg [C_M_AXI_ADDR_WIDTH-1 : 0] a_ptr_addr_reg , a_col_addr_reg , a_val_addr_reg;wire a_ptr_fifo_re ,a_col_fifo_re,a_val_fifo_re;reg [C_M_AXI_ID_WIDTH-1 : 0] current_arid ;//只要re信号保证绝对能给到数据就行。always@(posedge M_AXI_ACLK)beginif(!M_AXI_ARESETN)begina_ptr_fifo_re_reg <= 1'b0 ;a_col_fifo_re_reg <= 1'b0 ; a_val_fifo_re_reg <= 1'b0 ;end else beginif(a_ptr_empty & a_ptr_addr_v)begina_ptr_fifo_re_reg <= 1'b1 ; //直到re拉高就变为0endelse if(a_ptr_fifo_re)begina_ptr_fifo_re_reg <= 1'b0 ; endif(a_col_empty & a_col_addr_v)begina_col_fifo_re_reg <= 1'b1 ; //直到re拉高就变为0endelse if(a_col_fifo_re)begina_col_fifo_re_reg <= 1'b0 ; endif(a_val_empty & a_val_addr_v)begina_val_fifo_re_reg <= 1'b1 ; //直到re拉高就变为0endelse if(a_val_fifo_re)begina_val_fifo_re_reg <= 1'b0 ; endend end//assign a_ptr_fifo_re = (a_ptr_empty & a_ptr_addr_v)?1'b0:(!a_ptr_empty & a_ptr_fifo_re_reg)?1'b1:a_ptr_addr_v?1'b1:1'b0;assign a_ptr_fifo_re = (!a_ptr_empty) && (a_ptr_addr_v || a_ptr_fifo_re_reg); //上面代码的优化。assign a_col_fifo_re = (!a_col_empty) && (a_col_addr_v || a_col_fifo_re_reg); //上面代码的优化。assign a_val_fifo_re = (!a_val_empty) && (a_val_addr_v || a_val_fifo_re_reg); //上面代码的优化。//这里fifo必须为FWFT格式,否则仿真失败。assign a_ptr_data_v = ~a_ptr_empty & a_ptr_fifo_re ;assign a_col_data_v = ~a_col_empty & a_col_fifo_re ;assign a_val_data_v = ~a_val_empty & a_val_fifo_re ;sync_fifo a_ptr_fifo(.clk (M_AXI_ACLK) ,.rst_n (M_AXI_ARESETN) ,.din (M_AXI_RDATA) , .wr_en (M_AXI_RVALID & M_AXI_RREADY && (M_AXI_RID == 0)) ,.full () ,.almost_full () ,.dout (a_ptr_data) ,.rd_en (a_ptr_fifo_re) , //maybe wrong.empty (a_ptr_empty) ,.almost_empty (a_ptr_almost_empty) ,.data_cnt ());sync_fifo a_col_fifo(.clk (M_AXI_ACLK) ,.rst_n (M_AXI_ARESETN) ,.din (M_AXI_RDATA) , .wr_en (M_AXI_RVALID & M_AXI_RREADY && (M_AXI_RID == 1)) ,.full () ,.almost_full () ,.dout (a_col_data) ,.rd_en (a_col_fifo_re) ,.empty (a_col_empty) ,.almost_empty (a_col_almost_empty) ,.data_cnt ());sync_fifo a_val_fifo(.clk (M_AXI_ACLK) ,.rst_n (M_AXI_ARESETN) ,.din (M_AXI_RDATA) , .wr_en (M_AXI_RVALID & M_AXI_RREADY && (M_AXI_RID == 'd2)) ,.full () ,.almost_full () ,.dout (a_val_data) ,.rd_en (a_val_fifo_re) ,//最开始需要寄存地址有效信号//两种解决方法:一、给pe单元的复位信号加延时,等到m00三个fifo都有数据再复位,不推荐//二、寄存地址有效信号,逻辑资源使用增加。.empty (a_val_empty) ,.almost_empty (a_val_almost_empty) ,.data_cnt ());assign M_AXI_ARID = current_arid ;// User logic endsendmodule
