VERA DMA

doslogo · Post by **doslogo** » Sun Dec 22, 2024 7:26 pm

I have been programming the Commander X16 for a while now, and it feels odd that it doesn't have a way to upload a full palette and all the sprites to VRAM like most game consoles of the past did thanks to DMA. The issue is, there are 256 palette entries (which is 512 bytes, so two loops basically) and 128 sprite attributes (a lot more bytes to loop through), but not enough time in vblank to update them all in a single frame. This causes problems when a game needs to fade in or fade out the screen, or when a lot of sprites need to move at 60 Hz. Game consoles like the NES made sure that you could use the hardware to its fullest. And when there wasn't time to change all the palette entries to do fade ins and fade outs on the Gameboy Advance, a special register was invented that could be used instead, to update all the visual colors at once. Not a bad idea for VERA, to have a fade in/fade out register, or maybe even a turn-off display register? I do need the screen off (as in, not turning off the TV, but not showing garbage on the screen) while loading stuff, and games usually do this by fading out first.

Even a pseudo-DMA would help a lot. If VERA could keep track of source and destination and count (count not being 8 bit but more), and the CPU would be able to just wait in a busy loop, there would be speed-ups because VERA doesn't have to deal with indirect indexed loops that has to be manually carried every 256 bytes! Everyone would benefit from an update like this, if it is possible. I know this is a computer and not a game console, but where do the priorities go?

I am not asking for HDMA here. Just a way to skip having the 65C02 deal with slow indirect indexed loops, and let VERA take the bytes from RAM and place it in VRAM without the CPU.

desertfish · Post by **desertfish** » Mon Dec 23, 2024 8:11 am

Are you using vera's auto increment mode? Because updating the full palette during the vblank period is perfectly possible.
It's just 512 LDA's and STA's to VERA_DATA0 if set up correctly. Don't forget to unroll the loop for maximum throughput

The sprites are more data yes, but still. You won't be changing ALL sprite attributes every frame I would think?
I imagine updating the sprite positions is what is done most frequently. And again, if set up correctly using vera's auto increment and in this case, using both data channels , you can use 2 tight loops once again to update all 128 X and Y position attributes more or less inside the vblank period I would think

I've attached a quick test program I whipped up that moves all 128 sprites across the screen.

The red bar at the top is where the vsync update is "slightly too slow" and continues in the visible screen. It's setting all 128 sprite positions there. (Notice that it's just 8 or 9 scanlines too slow)
The green bar is where it is incrementing the sprite x and y in the positions arrays , for the next frame.The code is fairly optimized but can be improved more still. I'm pretty confident with some more hand tuning of the loops, all 128 sprite positions can be updated in the vsync area before the raster line hits the first visible line again.

Try It Now!

ahenry3068 · Post by **ahenry3068** » Mon Dec 23, 2024 12:22 pm

doslogo wrote: ↑Sun Dec 22, 2024 7:26 pm The issue is, there are 256 palette entries (which is 512 bytes, so two loops basically) and 128 sprite attributes (a lot more bytes to loop through), but not enough time in vblank to update them all in a single frame. This causes problems when a game needs to fade in or fade out the screen, or when a lot of sprites need to move at 60 Hz. Game consoles like the NES made sure that you could use the hardware to its fullest. And when there wasn't time to change all the palette entries to do fade ins and fade outs on the Gameboy Advance, a special register was invented that could be used instead, to update all the visual colors at once. Not a bad idea for VERA

All those things can easily be accomplished inside vblank with the existing architecture. (and there is a screen off register) My video playing code actually flips visibility on 70 sprites and copies all 512 bytes of the palette comfortably during Vblank. I also have a VERA decrement function that works to do a fade.

At work right now but I would be happy to provide some code for you later. My 2024 XMAS Demo does have the sprite flipping and palette copying code and it's not even fully optimized.

desertfish · Post by **desertfish** » Mon Dec 23, 2024 3:03 pm

If you need insight in some of the techniques required for performant 6502 and vera programming just ask away, people will be happy to help. The key is usually ending up with something that can utilize Vera's auto increment/decrement mode and possibly using both data ports at the same time, so that you are able to use a simple unrolled copy loop on the 6502 side. It won't get faster than that when raw data transfer is concerned.

Regarding DMA in particular: the only case where I have missed DMA is PCM sample playback. This is very cpu intensive right now (requires to load 160Kb/sec streaming from disk and copying all into the Vera PCM buffer). But a 80's inspired 8 bit system should perhaps have no business playing CD-quality stereo music. (even though the X16 can do it, but barely)

Guybrush · Post by **Guybrush** » Mon Dec 23, 2024 5:04 pm

DMA is not possible with VERA simply because there aren't enough address lines connecting VERA to the address bus. There are only 5 lines which gives you access to 32 registers and that's also why VERA FX requires you to use the DCSEL bits to access all of its functionality.
DMA functionality which would allow VERA to access RAM (even if it's not a full DMA implementation) would require all 16 address lines to be connected to VERA, plus some extra lines connected directly to the CPU (BE, RDY...). More address lines would require a larger FPGA which would obviously be more expensive.

Wavicle · Post by **Wavicle** » Mon Dec 23, 2024 6:48 pm

As I recall the DMA on NES lived in the 2A03 CPU.

It would be possible to have hardware external to VERA do DMA writes. The current VERA firmware could not keep up with writes every cycle so a delay would be necessary.

doslogo · Post by **doslogo** » Mon Dec 23, 2024 7:16 pm

desertfish wrote: ↑Mon Dec 23, 2024 8:11 am You won't be changing ALL sprite attributes every frame I would think?

First off, thank you for the test program!

I am very used to programming on the Sega Genesis and Nintendo DS, where you keep a local copy in RAM of all the sprites and palette, and then just copy them to OAM or VRAM with DMA during vblank. If one were to allocate VRAM for every sprite and every palette entry, and then only update a few things every now and then, it would be slower than just having a local copy in RAM, do all the work there, and copy it all in one fast operation over to VRAM at vblank time.

On the Commander X16, I am forced to only use 32 sprites (a balance of game frame time, not vblank time), and 32 duplicated sprites for effects (all constructed onto a heap during active display, to be uploaded to VRAM as fast as possible by the CPU during vblank), since I need vblank to also upload palette and tilemap and hopefully like 1 or 2 8x8 4 bpp tiles (no way to do dynamic animations, the Sega Genesis can upload 40 tiles with DMA easily and keep the framerate at 60 Hz). We are talking about a standard game here. It feels weird to leave so many hardware sprites for nothing, but I will be fine with what I currently have.

ahenry3068 wrote: ↑Mon Dec 23, 2024 12:22 pm All those things can easily be accomplished inside vblank with the existing architecture. (and there is a screen off register) My video playing code actually flips visibility on 70 sprites and copies all 512 bytes of the palette comfortably during Vblank. I also have a VERA decrement function that works to do a fade.

At work right now but I would be happy to provide some code for you later. My 2024 XMAS Demo does have the sprite flipping and palette copying code and it's not even fully optimized.

I was looking at that Second Reality demo and kept an eye on the palette during fade ins and fade outs, and there were no places where the entire palette was updated per frame. I took that as a sign that it would be impossible. I mean, why wouldn't you have 3D polygons flying around and also fade in at the same time using every single palette entry?

desertfish wrote: ↑Mon Dec 23, 2024 3:03 pm Vera's auto increment/decrement mode and possibly using both data ports at the same time

Oh, I have already tried that. That was my first setup. It became very complex in the end to have two data ports, having to increment the source addresses twice per write and keep track of their carries. The CPU was just faster at moving data through one port. I do use it for vertical tilemap updates where I need 2 neighbor bytes to be written, 128 bytes apart.

Guybrush wrote: ↑Mon Dec 23, 2024 5:04 pm DMA is not possible with VERA simply because there aren't enough address lines connecting VERA to the address bus. There are only 5 lines which gives you access to 32 registers and that's also why VERA FX requires you to use the DCSEL bits to access all of its functionality.
DMA functionality which would allow VERA to access RAM (even if it's not a full DMA implementation) would require all 16 address lines to be connected to VERA, plus some extra lines connected directly to the CPU (BE, RDY...). More address lines would require a larger FPGA which would obviously be more expensive.

Yep, as much as I would like to blame the price of FPGAs, I still think the potential IS there. I was surprised that a 16 bit CPU was being considered to be supported, when the bottleneck between the CPU and VERA is the actual problem.

I am currently living with the reality that the Commander X16 can do about 32 dynamic updating sprites per frame (sprites that change all their attributes in a frame). I am happy with 32 sprites. It just sucks that VERA has 128 for some reason (most likely cheaper to have more sprites and lesser address lines, than the opposite).

Wavicle wrote: ↑Mon Dec 23, 2024 6:48 pm As I recall the DMA on NES lived in the 2A03 CPU.

It would be possible to have hardware external to VERA do DMA writes. The current VERA firmware could not keep up with writes every cycle so a delay would be necessary.

NES is the prime example of why DMA is needed for the CX16. Make a game for the NES that uses only the number of sprites that you would be able to copy using the CPU alone (no DMA, though you can't write sprites without DMA so my example lacks merit). Sure you could get all the sprites to be visible at some point, but not all of them at the same time on each new frame. Money could have been saved and force games on the NES to only have a few active sprites moving, and design all the games to work with that. Games like Super Mario Bros. could have used those extra sprites as the status bar since they don't need to move or update or anything. But for some reason, Nintendo decided to make all the sprites available through DMA. The question remains, why was Nintendo so stupid to do so when they should have gone the Commander X16 way, and save the money instead?

ahenry3068 · Post by **ahenry3068** » Mon Dec 23, 2024 7:36 pm

Another trick to remember is AUTOINCREMENT can be other values besides 1 and can also be an autodecrement as well.

Again using my video playing code as an example.

The video "frame" is a block of 35 sprites. I double buffer this in Sprites 1-35 & 36-70.

One set of sprites is visible. The other set is being loaded from the "video" file.

When it's time to display the next frame my "Sprite Swap loop" simply loops 35 times storing %00001100 to VERA_DATA0 then another 35 times it does an STZ to VERA_DATA0. It does the SAME thing regardless of which set I'm activating. VERA_ADDR & Increment are set Prior to call. To turn on 1-35 & 36-70 OFF The Vera Address is set to point at the visibility byte of Sprite 1. AutoIncrement is set to 8 (the size of an attribute block) So each sta VERA_DATA0 sets visibility for 1 to 35 in turn then turns off 36-70.

To do the opposite the starting address is set to Sprite 70 and AUTODECREMENT is set to 8 so They Turn on 70 - 36 then turn off 35 to 1.

DragWx · Post by **DragWx** » Tue Dec 24, 2024 5:22 am

Code: Select all

; Initialize
 LDX #$1F
 LDY #$61
 STZ $9F25; for ADDRSEL
; =8 cycles

next:
; Set DATA0 address, set increment to +32.
 STY $9F22
 LDA #$FC
 STA $9F21
 STX $9F20
; =14 Cycles

; 32 pairs of LDA/STA, with 32-byte offset.
 LDA SpriteTable,X
 STA $9F23
 LDA SpriteTable+$20,X
 STA $9F23
 LDA SpriteTable+$40,X
 STA $9F23
; ...
 LDA SpriteTable+$3E0,X
 STA $9F23
; =256 Cycles

 DEX
 BPL next
; =5 Cycles, -1 on final loop

(EDIT 2: This code doesn't assemble, check my later post for a fix, or see if you can catch what the mistake is

)

Check my work, but this should transfer a full 1024-byte sprite table to the VERA in 8807 CPU cycles. Align SpriteTable to a 32-byte boundary. Make sure "next:" and "BPL next" are on the same 256-byte page together, or else add +32 cycles to that figure.

For reference, vblank is 11520 CPU cycles in VGA mode, while in composite and RGB modes it's 10671.36 (even fields) or 10925.44 (odd fields).

This partially-unrolled loop is a variant of Duff's Device, which I learned from StarTropics on NES, which uses something like this to transfer data to the PPU faster than a plain loop.

Edit: Fixed the initial LDX value (mistakenly wrote it in decimal), and made all the constants hexadecimal so now everything matches.

doslogo · Post by **doslogo** » Fri Jan 03, 2025 6:28 pm

DragWx wrote: ↑Tue Dec 24, 2024 5:22 am
Code: Select all
; Initialize
 LDX #$1F
 LDY #$61
 STZ $9F25; for ADDRSEL
; =8 cycles

next:
; Set DATA0 address, set increment to +32.
 STY $9F22
 LDA #$FC
 STA $9F21
 STX $9F20
; =14 Cycles

; 32 pairs of LDA/STA, with 32-byte offset.
 LDA SpriteTable,X
 STA $9F23
 LDA SpriteTable+$20,X
 STA $9F23
 LDA SpriteTable+$40,X
 STA $9F23
; ...
 LDA SpriteTable+$3E0,X
 STA $9F23
; =256 Cycles

 DEX
 BPL next
; =5 Cycles, -1 on final loop
Check my work, but this should transfer a full 1024-byte sprite table to the VERA in 8807 CPU cycles. Align SpriteTable to a 32-byte boundary. Make sure "next:" and "BNE next" are on the same 256-byte page together, or else add +32 cycles to that figure.

For reference, vblank is 11520 CPU cycles in VGA mode, while in composite and RGB modes it's 10671.36 (even fields) or 10925.44 (odd fields).

This partially-unrolled loop is a variant of Duff's Device, which I learned from StarTropics on NES, which uses something like this to transfer data to the PPU faster than a plain loop.

Edit: Fixed the initial LDX value (mistakenly wrote it in decimal), and made all the constants hexadecimal so now everything matches.

I'm impressed!
I am already thinking how to do this but with tile art, since 32 bytes per 4bpp tile, that is 32 tiles using this technique

Commander X16

VERA DMA

VERA DMA

Re: VERA DMA

Re: VERA DMA

Re: VERA DMA

Re: VERA DMA

Re: VERA DMA

Re: VERA DMA

Re: VERA DMA

Re: VERA DMA

Re: VERA DMA