CommandTracker: Mostly ideas for a music tracker sparse file format

kliepatsch · Post by **kliepatsch** » Thu Jan 21, 2021 4:36 pm

Here's what Concerto does with the voicing.

First off, terminology inside Concerto:

Oscillators correspond to the 16 voices of the VERA PSG. Every sound can use one or more oscillators.

Channels are the way Concerto organizes playback internally. There are 16 Channels. Each channel is monophonic, i.e. it hosts a single voice that is either active or inactive.

Voices refer to the notes that are played. Because each channel can host only a single voice, the two terms can often be used interchangeably. If you play a note on channel X, you automatically address voice X.

Each note-on event needs to specify

Channel

Timbre (which of the 32 synth patches to use)

Pitch

Volume

If a note-on event is issued, it is first checked whether there is already an active voice in the specified channel. If it is a different timbre than the one being played, the old voice is stopped and the new voice started. If the new voice has the same timbre as the old one, the voice is updated according to the "retrigger" and "portamento" settings.

Every time a voice is started, it is checked if there are enough oscillators available. If yes, they are taken off the list of available oscillators (and returned to it once the voice is stopped). If there aren't enough oscillators available, the note simply is not played. It can run out of oscillators.

There are two types of note-off events. Hard and soft ones. Hard note-offs (in the source code "stop_note") immediately deactivate the voice and return the oscillators to the free oscillator list. Soft note-offs ("release_note" in the source) trigger the release phase of all envelopes. As soon as the master envelope (the first one of the three) hits level 0, the voice is turned off automatically and the oscillators are returned to the free oscillators list.

Note-off events are simply communicated via the channel number.

Now I'd like to comment on a few things.

22 hours ago, m00dawg said:

Normally in a tracker, the channels and voices are 1:1 but when supporting multi-voice instruments, things are now not strongly coupled. Which I think is worth it, but means you could "run out" of voices. So yep having a way to see which voices are in use makes sense. Many trackers have this sort of feedback, though since voices == channels in those cases, it is a bit different than here. Something as simple as the equivalent of LEDs on a real synth showing which voices are active I think would be sufficient.

Yes, Concerto can run out of oscillators. And I agree some kind of visualization of how many are used seems reasonable.

I could imagine something in the direction of how Synth1 does it. See for reference

https://youtu.be/__2AFeG4xII?t=187

In the bottom right of Synth1 you see the 32 voices of the synth and which voices are currently active. Since the sound is making use of unison, it uses several detuned voices per note.

In our case, we would instead be displaying individual oscillators instead of voices (in the sense that a voice can employ several oscillators).

But your suggestion

On 1/17/2021 at 3:25 AM, m00dawg said:
So let's say I have something like:
ROW| VERA01 | VERA02 | VERA03 | ...
## | -------+--------+--------+
00 | C-4 01 | --- -- | --- -- |
01 | ... .. | ... .. | ... .. |
02 | C-4 02 | C-4 03 | ... .. |
	...
	
Sort of hard to represent in text, but basically if instrument 01 allocates 3 voices, we disable channels 2 and 3 (represented by the dashes, though in a multi-color tracker, we would probably just dim or highlight them to indicate they are in use). Then when we use instrument 02 on channel 1, if it's a single voice instrument, we know channels 2 and 3 are available and can be used for other things (as in the example).

seems possible, too. This would suggest that the user directly chooses which PSG oscillators to use instead of letting the synth engine manage that. I'm not a big fan of it, but it would provide more control, especially if you combine it with the user being able to stop individual oscillators and reuse them, while other oscillators of a previous note are still going. This is what you said if I understood you correctly

On 1/17/2021 at 3:25 AM, m00dawg said:

It could even be possible to override one of the voices (say by placing a note on VERA02 at row 01 in the above example where instrument 01 might have a long tail), then their new note would take precedence. And actually likewise, if someone wanted to use a multi-voice instrument on channel 01 at row 00 and then put another instrument on channel 02 on the same row, channel 02 should override the voice.

So in other words, the UI could provide cueues to channel usage when using multi-timbral voices, but the composer can still do as they please if they want to override voices.

If you did this, you would need some restructuring of the current voicing system. You need to allow for the extra freedom to terminate parts of a multi-oscillator sound. Currently, all oscillators are released at the same time. This would change if you allowed for partial releases.

It's good to see that you are already on your way with 6502 machine language.

m00dawg · Post by **m00dawg** » Thu Jan 21, 2021 5:36 pm

All great points!

Oh boy the voice vs channel thing might get confusing haha. But yeah sounds like your engine isn't far off from how a tracker would work, though you lost me a bit on your channel vs voice explanation. You mentioned:

Quote

Each note-on event needs to specify

Channel

Timbre (which of the 32 synth patches to use)

Pitch

Volume

But you mentioned each channel is monophonic? But the timbre (synth patch) could itself use multiple voices?

Your note-on looks very similar to my pattern note-on as well. For example:


	PSG NOTES [CHANNEL 0-15, VeraSound 1-16]
2-X bytes
                      | If Note Flag Set       | If Vol Flag Set | If Effect Flag Set                        |
Channel : N/V/E Flags | Note   : Octave : Inst | Pan    : Vol    | Next Effect Flag : Effect # : Effect Data |
5-bits  : 3-bits      | Nibble : Nibble : Byte | 2-bits : 6-bits | 1-bit            : 7-bits   : Byte        |

The NVE flags indicate what comes after the Channel/NVE byte. You can have note, pattern, or effects (or all) in pattern data so this helps keep things smaller (in the sparse format). You can see there's a "next effect" concept here where there can be, in this case, N-number of effects. In reality, the main issue here is how to represent a 'non-sparse' pattern buffer such as when the composer is actively writing a pattern. Having that in a non-sparse format makes lots of things much easier. In the sparse format, if I add a note in between two previously defined rows and channels I'll have to do a lot of shifting around of data. Or at least I haven't figured out a great way to solve this. That means, practically, the number of effects might have to be limited.

In terms of the channel allocation, instead of the example I mentioned previously (the one you commented on), it is the most powerful option but the trade-off is having to navigate through patterns more when composing a pattern. I kinda like the idea of moving allocations to the end (last channels). So instead of, in the above example, channels 2 and 3 getting flagged as used, 15 and 16 would get flagged as used instead. That way you can keep the pattern data easier to follow since there won't be a lot of empty channels in the middle a pattern to contend with - more of the data will get pushed to the lower channels this way. The downside is the composer looses control of which notes get overtaken in this scenario I think - or at least it becomes a lot less obvious Instead of pattern-UI cueues though it may be simpler and almost just as effective to have an active voice UI area (e.g. Synth1).

I'd imagine most songs, especially with multi-voice instruments, won't use 16 channels commonly within patterns. I could see it in some cases but probably won't be super common. In this case, multi-voice instruments is a nice solve because it keeps the pattern data more sparse (at least in the sparse file format).

kliepatsch · Post by **kliepatsch** » Thu Jan 21, 2021 11:17 pm

5 hours ago, m00dawg said:

Quote

Each note-on event needs to specify

Channel

Timbre (which of the 32 synth patches to use)

Pitch

Volume

But you mentioned each channel is monophonic? But the timbre (synth patch) could itself use multiple voices?

Correct. Channels are monophonic and timbres can be used multiple times. For each event it needs to be specified on which channel it is. I don't see why that should contradict each other ?

Edit: I think I get the point. In a tracker, the events are on a channel, so you don't have to specify which channel for each event. You simply put them on which channel you want them. But nevertheless, you have to communicate that to the synth engine.

5 hours ago, m00dawg said:

In reality, the main issue here is how to represent a 'non-sparse' pattern buffer such as when the composer is actively writing a pattern. Having that in a non-sparse format makes lots of things much easier. In the sparse format, if I add a note in between two previously defined rows and channels I'll have to do a lot of shifting around of data. Or at least I haven't figured out a great way to solve this. That means, practically, the number of effects might have to be limited.

I was worried about this as well, when I started out. But then I read somewhere in this forum about how Stefan had solved this problem for X16 edit. I don't recall where exactly he talked about it, but he manages memory in blocks of 256 bytes I think (memory "pages"). Each block of data contains a couple of bytes of metadata, pointing to the next and the previous block and how many bytes of the block are actually used. That way, if the composer inserts/deletes effects/notes/whatever, you never have to move more than 256 bytes of data around (which should be fast enough for human editing). If more space is needed, you simply allocate a new page of data and move into it what couldn't fit into the other pages, and update the pointers of the previous and the following page. He puts those pages into the banked RAM. Of course, you need to keep track of which pages have been used and which ones are free. (And memory could get fragmented over time, which Stefan also solved IIRC). I think a lot of these concepts could be applied to a tracker. I found it helpful to at least know that those problems can be solved. ?

m00dawg · Post by **m00dawg** » Sat Jan 23, 2021 6:04 pm

@JimmyDansbo Here's the thread! If you want to see my entire pattern explanation it's here:

https://gitlab.com/m00dawg/commander-x16-programs/-/blob/master/command_tracker/file_formats/sparse_pattern_format.md

I should preface this by saying this is a big topic so you're advise thus far has been super helpful! And if this is a bit too much to get into, no worries!

This is tracker centric so if you're not familiar with music trackers, as a quick intro to patterns, a pattern is comprised of rows. Usually it's 64 (though can be other lengths in many trackers). It's a bit like a piano roll. Each row then has multiple channels - for the X16 there is, currently, 25 (16 PSG, 8 FM, 1 DPCM). In each channel, we can have a note, volume, or 1 (or more) effects. So borrowing an example from up above:


ROW| VERA01 | VERA02 | VERA03 | ...
## | -------+--------+--------+
00 | C-4 01 | --- -- | --- -- |
01 | ... .. | ... .. | ... .. |
02 | C-4 02 | C-4 03 | ... .. |
	...

So in this simplified example, at row 0 we are playing C at octave for using instrument 1 on channel 1. Row 1 has no data, and row 2 has note data for channels 1 and 2.

In a non-sparse format, this would end up being a big matrix as one way to look at it. Where every row has data for all 25 channels, even if empty. So in terms of multi-arrays, I could perhaps see where there is 25 arrays of 64 where each array is a channel maybe?

But for sparse storage, it gets more complicated. In the above example, the sparse format might be something like:

00 01 C-4 01 30

02 01 C-4 02 02 C04 03 31

I simplified this a bit compared to the real file format definition but hopefully it shows off the problem. Row 00 would be, say, 4 bytes. row 01 isn't stored at all since it has no data, and row 02 is 8 bytes. 30 and 31 are special channels that indicate we are at the end of defined data for a row (30) and pattern (31). Again simplifying a bit since the actually channel byte includes some flags so the byte value would be different.

A full song is comprised of a series of patterns using an order list (internally an array of pointers to patterns).

Since the patterns are compact and variable data, in this scenario, I don't think I can avoid storing the data in a single sequential array of bytes?

Hopefully all that makes sense. I recognize that it's a non-trivial problem and also point out I'm still pretty green when it comes to assembly programming here. My sparse file format actually looks similar to GoatTracker's so that makes me think I'm on the right track. It seems like some of the modern trackers on PC (e.g. Deflemask) use non-sparse patterns since there's plenty of RAM to spare, but on the X16 I don't think I can get away with that easily.

If I pull back on a few features (notably multiple effects) then I can likely cram a single full pattern into 1 bank of 8k RAM. That makes things considerably simpler - each pattern gets mapped to a bank and we're done. And for the tracker, this may be ideal. It would still be too much RAM (in my opinion anyway) to use for just playing a song back, say in a game, where the sparse format saves A TON of space.

JimmyDansbo · Post by **JimmyDansbo** » Sat Jan 23, 2021 8:18 pm

Unfortunately I am not familiar with trackers other than I have used some to play music in DOS many years ago.

2 hours ago, m00dawg said:

00 01 C-4 01 30

02 01 C-4 02 02 C04 03 31

~~Where does the '30' and '31' come from in the end of those 2 lines?~~

I am thinking about the sparse format... there might be a way

JimmyDansbo · Post by **JimmyDansbo** » Sat Jan 23, 2021 9:04 pm

I think I might have a suggestion on how you could read and use the sparse file format.

Beware, this is a mix of pseudo- and assembly- code and it is getting quite late here, but I think you might get the idea.

Quote

; Initialization

; Load sparse file into RAM bank, address = $A000

SONG_START   = $A000

Tick_ptr     = $22   ; $22 & $23, 2 zero-page bytes for storing address value

   ; Set Tick pointer to start of song

   lda   #<SONG_START

   sta   Tick_ptr

   lda   #>SONG_START

   sta   Tick_ptr+1

   stz   Curr_row   ; Current row set to 0 = start of song

; This function is called at regular intervals

on_tick_function:

   ldy   #0       ; Use Y register as index

   lda   (Tick_ptr),Y   ; Load row number

   ; If it is equal to current row, we need to handle the data

   ; otherwise we increment current row and wait for next tick

   cmp   Curr_row

   bne   @prep_next

@do_channels:

   ; Read channel number

   iny

   lda   (Tick_ptr),Y

   sta   Channel

   ; Read note

   iny

   lda   (Tick_ptr),Y

   sta   Note

   ; Read octave

   iny

   lda   (Tick_ptr),Y

   sta   Octave

   ; Read instrument

   iny

   lda   (Tick_ptr),Y

   sta   Instrument

   ; Call a function that handles the actual playing of notes on channels

   ; This function should read the content of the global variables:

   ; * Channel

   ; * Note

   ; * Octave

   ; * Instrument

   jsr   Play_note_on_channel



   ; Read next byte to see if we have reached end of row

   iny

   lda   (Tick_ptr),Y

   ; If we have read 30 or 30, we are done with this line

   cmp   #30

   beq   @prep_next

   cmp   #31

   beq   @prep_next

   bra   @do_channels

@prep_next:

    inc   Curr_row

   ; Move Tick pointer to next row

   iny           ; Add the value in Y to the Tick pointer

   tya           ; to prepare for next time the function

   clc           ; is called.

   adc   Tick_ptr

   sta   Tick_ptr

   bcs   @end

   lda   Tick_ptr+1

   adc   #0

@end:

   rts

Channel       !byte    0

Note       !byte    0

Octave       !byte    0

Instrument   !byte    0

Curr_row   !byte   0

kliepatsch · Post by **kliepatsch** » Sun Jan 24, 2021 6:50 pm

I think this is roughly how it should be done. There's a couple of things that need to be considered. There will be more event types than note-ons. There's also note-offs and effects. And it will be a fun puzzle to pack all that data efficiently.

There are a couple of questions that arise:

How do you interact with the data when editing? How is it displayed and how do you find the correct address when you are not reading from beginning to end, but rather look for the event that is in channel X and row Y, if there is one, at all. And last, how do you go about inserting and deleting events.

For insert and delete, I would treat the data as a big string where we want to insert or delete a few bytes, and try to employ similar strategies as in X16edit.

The data structure for the text in X16edit is a bidirectionally chained list of ~250 byte long blocks. I think nothing speaks against having many such lists on a "heap" that correspond to different patterns.

Searching data by channel and row ... Hmm ... This is another interesting challenge. The first thing that comes to my mind is searching by bisecting. But this will be a complex routine because of the way the data will likely be structured. Another way would be a lookup table that stores the starting addresses for each row, which is generated and stored only for the currently active pattern. From the starting address of the row you only need to read along the data string until you find the right channel, which won't take long since there are only like ~25 channels. I honestly think that this could be the way to go.

Displaying imo is not hard. You basically start with the assumption that the entire pattern is empty. That's the way you initialize the whole pattern in the graphics memory. Then you parse the sparse pattern just like you would do for playback. It's essentially the same thing. Just instead of communicating the events to the sound engine, you put them into the corresponding spot in the pattern table.

Ha this has become a long string of thoughts, but is essentially my current state of ideas on the topic.

JimmyDansbo · Post by **JimmyDansbo** » Sun Jan 24, 2021 7:19 pm

When you have loaded the file, and figured out how to display it, you might actually be able to use VERA's RAM as the "workspace".

What I mean is that you place the entire song in VRAM as the characters that should be displayed on screen, when you then "move" around the song, you scroll the layer. Editing is done direclty in VRAM and if you want to play the song, you read the characters from VRAM and convert them into data that can be used by the PSG.

That way, you don't have to keep track of lists, you just use the coordinates in VRAM.

Not sure if it is fast enough though?

m00dawg · Post by **m00dawg** » Sun Jan 24, 2021 7:25 pm

Yep ya'll make some great points and thanks for the code Jimmy that's helpful and yep looks like about how I was thinking about it in my pseudo-code.

I think at least for the first version of the tracker, though it's super wasteful, using banked RAM to store the full patterns probably makes the most sense for now. It means I have to go down to 1 effect per channel/row but there's a few workarounds for that. One is to set effects before you get to the note, if you weren't previously outputting anything. Another is some concept of macros (AdlibTracker has these as I recall) where one could define effects there and simply call the macro.

Using banked RAM also makes the order list super easy. The pattern # is just the bank #. This leaves bytes on the tables (not many but some) but for the tracker itself, makes things a lot simpler.

For the file storage and for an embedded playback routine, the sparse format I think is definitely superior. I suppose for playback something akin to a VGM file might be even better...hmm...

Quick update on the tracker itself, it actually does play notes, and displays their values. It's small steps but it's something!

m00dawg · Post by **m00dawg** » Sun Jan 24, 2021 7:30 pm

6 minutes ago, JimmyDansbo said:

When you have loaded the file, and figured out how to display it, you might actually be able to use VERA's RAM as the "workspace".

What I mean is that you place the entire song in VRAM as the characters that should be displayed on screen, when you then "move" around the song, you scroll the layer. Editing is done direclty in VRAM and if you want to play the song, you read the characters from VRAM and convert them into data that can be used by the PSG.

That way, you don't have to keep track of lists, you just use the coordinates in VRAM.

Not sure if it is fast enough though?

Yeah I had thought about doing this for the current working pattern (instead of banked RAM like I mentioned above) and also using this to scroll patterns on playback. The full pattern is 25 channels by 64 rows so it ends up being pretty big. Even with the largest map size, I run out of horizontal space on the map for all the channels.

However, I think this is the right option to scroll the pattern though I think. Since you can't see all channels at once, I'll only have to render part of the full pattern width and I think should have enough space to store all 64 rows of that partial view for efficient scrolling.