So far I can only multiply two 16-bit values by setting the 32-bit cache directly, and then having VERA write the result to VRAM for me to fetch using the CPU.
The official documentation talks about the "accumulator" as some form of reset operation. I need addition of two multiplication results like (X*X)+(Y*Y). But there are no examples other than that single multiplication I have managed to do. How can one multiplication turn into multiple multiplications? How can their results be added before being written to VRAM?
FX: Sum of several multiplications?
Re: FX: Sum of several multiplications?
I hope that I get this right, I'm interpreting the Verilog code for VERA:
1. Reset the accumulator (read from $9F29, DCSEL=6; or write bit 7 of $9F2C, DSCEL=2)
2. Perform first multiply
3. Add that product to the accumulator (read from $9F2A, DCSEL=6; or write bit 6 of $9F2C, DCSEL=2)
4. Perform your second multiply
5. Add that product to the accumulator (read from $9f2A, DCSEL=6; or write bit 6 of $9F2C, DCSEL=2)
The result should now have the sum of your two multiplies.
The documentation definitely needs example code for this.
1. Reset the accumulator (read from $9F29, DCSEL=6; or write bit 7 of $9F2C, DSCEL=2)
2. Perform first multiply
3. Add that product to the accumulator (read from $9F2A, DCSEL=6; or write bit 6 of $9F2C, DCSEL=2)
4. Perform your second multiply
5. Add that product to the accumulator (read from $9f2A, DCSEL=6; or write bit 6 of $9F2C, DCSEL=2)
The result should now have the sum of your two multiplies.
The documentation definitely needs example code for this.
Re: FX: Sum of several multiplications?
I checked as well, and I agree with Wavicle's interpretation. I also found some points to add.
1) The 32-bit cache is always fed to the multiplier. Any change to the cache (including changes due to Cache Fill Enable mode) is immediately reflected in the multiplier's output.
2) While Multiplier Enable is 1, anything that uses the 32-bit cache as "output" (such as Cache Write Enable) will instead receive the value (Multiplier_Output +/- Accumulator_Output).
3) When you "Accumulate", the new value of Accumulator_Output will be (Multiplier_Output +/- Accumulator_Output).
4) When you "Reset Accumulator", the new value of Accumulator_Output will be 0. Because of point (2), it's always a good idea to reset the accumulator at least once in your program before you use the multiplier even if you don't plan to use the accumulator, or else you may receive unexpected results.
5) The combination of points (2) and (3) means, you don't need to accumulate after your final multiplication, just output.
6) For "+/-" mentioned above, Subtract Enable being 0 results in "+", and 1 results in "-". I believe this change is immediate (I haven't tested).
7) Accumulate and Reset Accumulator are triggered as a one-shot. That is, they trigger once with each read of $9F2A[dcsel=6] and $9F29[dcsel=6] respectively, or each time you write a '1' to the respective bits of $9F2C[dcsel=2] (you don't need to reset those bits to '0' in between).
And now an undocumented(?) feature assuming I'm reading this schematic correctly: when you Accumulate with Multiplier Enabled set to 0, the accumulator will be initialized to the current contents of the 32-bit cache.
The FX multiplier/accumulator is a DSP block on the FPGA itself, rather than an implementation in Verilog, so it's tricky to know exactly how it works unless you have the documentation for the FPGA too.
1) The 32-bit cache is always fed to the multiplier. Any change to the cache (including changes due to Cache Fill Enable mode) is immediately reflected in the multiplier's output.
2) While Multiplier Enable is 1, anything that uses the 32-bit cache as "output" (such as Cache Write Enable) will instead receive the value (Multiplier_Output +/- Accumulator_Output).
3) When you "Accumulate", the new value of Accumulator_Output will be (Multiplier_Output +/- Accumulator_Output).
4) When you "Reset Accumulator", the new value of Accumulator_Output will be 0. Because of point (2), it's always a good idea to reset the accumulator at least once in your program before you use the multiplier even if you don't plan to use the accumulator, or else you may receive unexpected results.
5) The combination of points (2) and (3) means, you don't need to accumulate after your final multiplication, just output.
6) For "+/-" mentioned above, Subtract Enable being 0 results in "+", and 1 results in "-". I believe this change is immediate (I haven't tested).
7) Accumulate and Reset Accumulator are triggered as a one-shot. That is, they trigger once with each read of $9F2A[dcsel=6] and $9F29[dcsel=6] respectively, or each time you write a '1' to the respective bits of $9F2C[dcsel=2] (you don't need to reset those bits to '0' in between).
And now an undocumented(?) feature assuming I'm reading this schematic correctly: when you Accumulate with Multiplier Enabled set to 0, the accumulator will be initialized to the current contents of the 32-bit cache.
The FX multiplier/accumulator is a DSP block on the FPGA itself, rather than an implementation in Verilog, so it's tricky to know exactly how it works unless you have the documentation for the FPGA too.
Re: FX: Sum of several multiplications?
Thank you Wavicle and DragWx! I now got a functional dot product subroutine that takes two vectors in 1.7.8 fixed point format, and returns a scalar as a 1.7.8 fixed point, not including the call overhead or setting the values, takes 161 cycles:
Code: Select all
ZP_MUL_X0 = $22 ; Zero Page
ZP_MUL_X1 = $24
ZP_MUL_Y0 = $26
ZP_MUL_Y1 = $28
VRAM_MULT_X = $1F9B8 ; Must be aligned to 4 bytes offset in VRAM!
; A test call to the subroutine below...
jsr _init_mul_once
lda #>$500 ; vector0 x = 5.0
sta ZP_MUL_X0+1
lda #>$500 ; vector1 x = 5.0
sta ZP_MUL_X1+1
lda #>$C00 ; vector0 y = 12.0
sta ZP_MUL_Y0+1
lda #>$100 ; vector1 y = 1.0
sta ZP_MUL_Y1+1
stz ZP_MUL_X0 ; Fractions stay zero
stz ZP_MUL_X1
stz ZP_MUL_Y0
stz ZP_MUL_Y1
jsr __dot
.... Check result in ZP_MUL_X0 and ZP_MUL_X0+1, result should be = 37.0
;
; Does dot product ZP_MUL_X0*ZP_MUL_X1+ZP_MUL_Y0*ZP_MUL_Y1 and returns the >> 8 result in ZP_MUL_X0 and ZP_MUL_X0+1:
;
__dot:
stz $9F25 ; ADDRSEL 0
lda #<VRAM_MULT_X
sta $9F20 ; Start at VRAM_MULT_X
lda #>VRAM_MULT_X
sta $9F21
lda #^VRAM_MULT_X ; No increment (overwrite the first product with the second in VRAM)
sta $9F22
; Access cache directly
lda #(6<<1) ; DCSEL=6 ADDRSEL=0
sta $9F25
; Zero the accumulator
lda $9F29 ;FX_CTRL
; First multiply
lda ZP_MUL_X0
ldx ZP_MUL_X0+1
ldy ZP_MUL_X1
sta $9F29
stx $9F2A
sty $9F2B
lda ZP_MUL_X1+1
sta $9F2C
lda #(2<<1) ; DCSEL=2 ADDRSEL=0
sta $9F25
lda #%01000000 ; Cache Write Enable (writes cache to any of the address ports) Addr1=0
sta $9F29 ;FX_CTRL
stz $9F23 ; And multiply X0*X1, store 4 bytes result in VRAM_MULT_X
; Access cache directly
lda #(6<<1) ; DCSEL=6 ADDRSEL=0
sta $9F25
; Add product to the accumulator
lda $9F2A
; Second multiply
lda ZP_MUL_Y0
ldx ZP_MUL_Y0+1
ldy ZP_MUL_Y1
sta $9F29
stx $9F2A
sty $9F2B
lda ZP_MUL_Y1+1
sta $9F2C
lda #(2<<1) ; DCSEL=2 ADDRSEL=0
sta $9F25
stz $9F23 ; Store 4 bytes result in VRAM_MULT_X
stz $9F29 ;FX_CTRL ; No more cache writes (normal access to VRAM)
stz $9F25 ; DCSEL 0 (End of multiply FX)
lda #<(VRAM_MULT_X+1) ; Return the dot product as a 1.7.8 fixed point value!
sta $9F20 ; Start at VRAM_MULT_X next time
lda #($10|^VRAM_MULT_X) ; Increment 1
sta $9F22
lda $9F23
sta ZP_MUL_X0
lda $9F23
sta ZP_MUL_X0+1
rts
; Once on init, enable multiplier:
_init_mul_once:
lda #(2<<1) ; DCSEL 2
sta $9F25
lda #%00010000 ; Multiplier Enable (we always want the result from the cache when written to VRAM and not the contents)
sta $9F2C
stz $9F29 ; FX_CTRL
rts
Last edited by doslogo on Tue Jan 07, 2025 9:52 pm, edited 2 times in total.
Re: FX: Sum of several multiplications?
I found a couple errors in your code:
- It never actually touches Multiplier_Enable ($9F2C.4, DCSEL=2).
- It mistakenly reads from $9F25 instead of $9F29 when trying to reset the accumulator, although it sets DCSEL correctly right beforehand.
- It mistakenly writes to $9F25 instead of $9F29 when trying to enable (and later disable) Cache_Write_Enable, although it sets DCSEL correctly right beforehand.
Re: FX: Sum of several multiplications?
Yes, I messed up my own defines when I manually wrote the code for the forum. I have edited my previous post's code. I added an init_once subroutine that is similar to how my game is dealing with initialization.DragWx wrote: ↑Tue Jan 07, 2025 1:37 am I found a couple errors in your code:And then a note: There's no need to write the product to VRAM before accumulating, simply accumulating is enough.
- It never actually touches Multiplier_Enable ($9F2C.4, DCSEL=2).
- It mistakenly reads from $9F25 instead of $9F29 when trying to reset the accumulator, although it sets DCSEL correctly right beforehand.
- It mistakenly writes to $9F25 instead of $9F29 when trying to enable (and later disable) Cache_Write_Enable, although it sets DCSEL correctly right beforehand.
Other than me missing the "stz VERA_FX_CTRL ; $9F29 (mainly to reset Addr1 Mode to 0)" from the example in the documentation in the init subroutine I just wrote, I think it is functional.
Do I need to set bit 6 (accumulate) in $9F2C DCSEL=2 on initialization? I don't think I am, since things are working.
Yes, by not going back to DCSEL=2, staying in DCSEL=6, adding product to the accumulator did the trick! After going back to DCSEL=2, write cache enable bit must be turned on for the two added multiplications to be stored to VRAM. This is a few cycles saved, so awesome! Thank you DragWx!
Re: FX: Sum of several multiplications?
You're welcome!
You don't need to, because writing a "1" to bit 6 in $9F2C[DCSEL=2] does the exact same thing as reading $9F2A[DCSEL=6], you just use whichever one is convenient for your code.