Signature for ROM based programs

TomXP411 · Post by **TomXP411** » Mon Jan 25, 2021 7:26 pm

10 hours ago, kktos said:

? It was me saying oh no, not again ? But I'm smiling while writing it down. No worries ?

So you asked a very good question: why ? Good, let's think about it.

Honestly, Commodore knew what they were doing with the KERNAL jump table, and while I strongly disagree with several of their design principles, this is one that makes sense.

Quote

1- (Weak argument) you'll have a 3 bytes * n addresses, bigger than a 2 bytes * n.

The 6502 does not support an indirect JSR, so using a vector table means writing additional code to set up the system call in the first place. The best use for vector tables is in mutable entry points. For example, we can modify the behavior of BASIC at runtime by changing the vector that reads the tokens:

Quote

-2- (Weak argument) you'll have to remember a lot of addresses

If there are n system functions to call, you need n function numbers. It doesn't matter of those take the form JMP n or LDX n; JSR SysCall. You still have the same number of addresses to memorize; you're just plugging those addresses into a different place in your code.

Quote

- vectors table

-1- you'll have a single point of entry

-2- your API could be very easily consistent and coherent. You can have a stub code dealing with the parms and then calling the internal function. you're making a toolbox rather than a bag of tricks.

-3- for the dev, once he gets how to deal with your entrypoint API, job's done. Pretty easy to use.

-4- (Weak argument) pretty easy to add, change or remove a function as all calls go thru the stub.

A single point of entry has no inherent value when a program has multiple routines that can be invoked.

This does nothing to promote consistency. With only 3 registers, your program still needs a way to pass parameters. The JSR address has nothing to do with the parameter addresses.

If developers need to learn to "deal with your entry point API" on top of your individual functions, you did something wrong.

Irrelevant. A dispatch routine still needs a lookup table for subroutine addresses, and the developer still needs to be informed of new and deprecated routines.

A single point of entry is a liability, as you're going to waste at least 20 cycles just loading the jump address, plus additional cycles marshaling parameters. And let's not forget that using up a parameter for your function name means losing that space for passing parameters to your functions. This increases complexity, which either decreases performance or decreases reliability.

Finally, there's nothing inherently coherent or consistent about a single entry point; the consistency of your API depends entirely on... designing a consistent API. In fact, the worst APIs I have seen are the ones where someone tried to shoehorn a bunch of functions into a single entry point, rather than creating a separate function for each procedure.

Consider the following c style function calls. Which is more readable?

DispatchCommand(LoadFileFunction,"filename");

or just

Load("filename");

The second is always going to be easier to read and easier to invoke. Likewise, when I'm looking at Commodore assembly code, and I see JSR $FFD2, I know that's CHROUT. I don't have to backtrack the code to look for an LDX #2, because JSR $FFD2 is an explicit call to the CHROUT function. It will always mean exactly that, no matter what data is in parameter memory or in the registers.

kktos · Post by **kktos** » Mon Jan 25, 2021 7:42 pm

11 minutes ago, TomXP411 said:

? Thanks for sharing your thoughts.

pzembrod · Post by **pzembrod** » Mon Jan 25, 2021 9:45 pm

13 hours ago, kktos said:

? It was me saying oh no, not again ? But I'm smiling while writing it down. No worries ?

? Trying to reply inline. Spoiler: Most of your reasons I think I either don't buy or don't understand. ?

13 hours ago, kktos said:

So you asked a very good question: why ? Good, let's think about it.

- list of JMPs

-1- (Weak argument) you'll have a 3 bytes * n addresses, bigger than a 2 bytes * n.

If we assume that each function gets called once, then the saved jmp opcode byte gets more than offset by the 2 bytes of the additionally needed lda #function_id in the function call.

13 hours ago, kktos said:

-2- (Weak argument) you'll have to remember a lot of addresses

You either have to remember a lot of function numbers or a lot of function addresses. Since each address is equal to $c000 + 3 * function number, I find the difference not really significant. A number would be a little easier to remember, I guess, but in practice you'd want an include file with label defines in either case.

13 hours ago, kktos said:

- vectors table

-1- you'll have a single point of entry

This is the only point where I see a certain merit, but only in case you need the single point of entry to be flexible, e.g. if you want to route the entire API through a jmp(), so you can e.g. easily switch between different implementations, flavours, feature sets etc. Or if the library and thus its entry point get loaded to a dynamically determined memory area (easy with pure-relative-addressing 6809 code, harder with non-relocatable 6502 code).

But neither of these cases I see applying to an X16 ROM.

13 hours ago, kktos said:

-2- your API could be very easily consistent and coherent. You can have a stub code dealing with the parms and then calling the internal function. you're making a toolbox rather than a bag of tricks.

I don't see why a function parameter should make an API more or less consistent or coherent. Both, consistency and coherence, come, imho, from the functions' and their parameters' semantics, from the mental model behind them, etc, not from which dispatching mechanism is chosen.

13 hours ago, kktos said:

-3- for the dev, once he gets how to deal with your entrypoint API, job's done. Pretty easy to use.

Same argument can be made for a jump list. No difference, imho.

13 hours ago, kktos said:

-4- (Weak argument) pretty easy to add, change or remove a function as all calls go thru the stub.

Again, no difference, if you assume that the function code dispatching is to be efficient, i.e. through a vector table. Inserting, adding, removing jmp statements from a jump table is no more nor less easy than doing the same with a vector table and a list of function code definitions. I really don't see a substantial difference here.

I can totally accept your reasons as your personal preferences; we all have our styles and tastes. But except for the potential need to route the entire API through a single indirect jmp, I don't see any substantial design advantages in what you argue for.

13 hours ago, kktos said:

About how to pass the parms, it's open.

From a dev point of view, I like to call "things" that are not messing with my code. In other words, I have only a few registers on my 6502. please, don't scrambled them.

Therefore, I prefer to go towards the parms after the the JSR entrypoint so I'm not using the registers for the call.

Don't you ever want to pass some values from your registers into the API functions you call? Do all or most of your calls have constant parameters? Or do you then effectively create self-modifying code where you write into the params after the JSR statement?

13 hours ago, kktos said:

But there are other ways to do that.

The main point here is that I want the call to have the minimalistic impact on my code.

So implied saving and restoring of registers is essential to you, I understand?

TomXP411 · Post by **TomXP411** » Mon Jan 25, 2021 9:54 pm

52 minutes ago, kktos said:

? Thanks for sharing your thoughts.

?

I've actually spent a lot of time thinking about this, a while back. A couple of years ago, I started working on a kernel for another 65x based system, and so I considered several ways to handle system calls. I considered interrupts (the 65816 can pass parameters to software interrupts with the COP opcode) and a system call dispatcher. I also considered dynamic linking and mechanisms for handling programs spanning multiple banks. At one point, we even talked about simply porting DOS/65 straight over to this new system. (That probably would have been ideal, but it appears DOS/65 is not open source, and so we would have ended up with the same problems that David and team had with the Commander operating software.)

So I looked at why DOS and CP/M actually use their system call conventions: Dos uses INT 21h for the bulk of its system calls, and CP/M uses CALL 5.

In both instances, the use case is similar: stuff a register with a command, stuff another register with the argument, and issue the system call.

To print a character in DOS, a character, a program stuffs DL with the character, AH with 02, and performs INT 5. This jumps to an address stored in the interrupt vectors at the start of RAM (there are actually 4 bytes per vector, since 8088 addresses use a segment register to extend addresses out to 20 bits.) Since the interrupt vectors in the 8088 are part of the 8088 hardware design, we get this feature for free. Once the OS loads, it just needs to set vector 21h to the right address, and then applications can make use of it.

On the 8080, we have to work a little harder, since there's no hardware support for relocatable code. That's where MOVCPM comes in. This program actually relinks the CP/M code in RAM by adjusting the operand of JMP calls. Apparently, this was originally done by assembling the operating system twice: once at 0h and once at 100h. The bytes that changed in the second copy were all internal jump instructions, and MOVCPM contains a map of those instructions and their addresses relative to the start of the program. So on an 8080 system, a user actually needs to run MOVCPM and then SYSGEN a new boot disk if he changes the amount of RAM in his system or when he installs CP/M for the first time. (I assume the CP/M distribution came set up out of the box with the smallest possible memory configuration; it was like that in the recent version of CP/M I used to upgrade my Altair.)

So barring those reasons, all of which are based around dynamic relocation of code, I came to the conclusion that the simple jump table was the best choice for the public API. (That's not say all of their design choices were the best. I do not like their string handling, for example. Strings should be null terminated, rather than forcing the programmer to pass the length in whenever a string is referenced. This solves several problems brought on by Commodore's design.)

pzembrod · Post by **pzembrod** » Mon Jan 25, 2021 9:56 pm

2 hours ago, TomXP411 said:

Honestly, Commodore knew what they were doing with the KERNAL jump table ...

Hi Tom, I hadn't read your reply yet when I composed my most recent post. You already said it all. ? My post is just a repetition with different words.

TomXP411 · Post by **TomXP411** » Tue Jan 26, 2021 12:39 am

2 hours ago, pzembrod said:

Hi Tom, I hadn't read your reply yet when I composed my most recent post. You already said it all. ? My post is just a repetition with different words.

It never hurts to have more than one perspective on things. ?

kktos · Post by **kktos** » Tue Jan 26, 2021 8:10 am

Tough crowd ?

Looks like I wasn't good on that one. Doh. Shame on me ?

Anyway, that's an pretty good exercise ! There are things I took for granted thanks for some XP. And I talked about those as if they were obvious. They are not. Obviously.?

On a personal note, I think I will try to use this exercise in my team while doing code review. Ask the one presenting to explain a concept to others.

I learn everyday. And that makes life so interesting.

ok, so, now, back to the board to rethink the whole explanation.?

BruceMcF · Post by **BruceMcF** » Tue Jan 26, 2021 8:13 am

12 hours ago, TomXP411 said:

A single point of entry is a liability, as you're going to waste at least 20 cycles just loading the jump address, plus additional cycles marshaling parameters. And let's not forget that using up a parameter for your function name means losing that space for passing parameters to your functions. This increases complexity, which either decreases performance or decreases reliability.

A direct index into a vector table is a net 5 cycles ... two for LDX #n, six for JMP (addr,X), versus three for JMP n ... with a further six or seven if x must be preserved.

But for toolbox ROM blocks such as table based multiple and divide routines, the point is to save clock cycles ... while nothing like the extreme slow down of passing parameters embedded inline after the subroutine call, even 5 extra clocks is preferable to avoid.

TomXP411 · Post by **TomXP411** » Tue Jan 26, 2021 10:33 am

2 hours ago, BruceMcF said:

A direct index into a vector table is a net 5 cycles ... two for LDX #n, six for JMP (addr,X), versus three for JMP n ... with a further six or seven if x must be preserved.

But for toolbox ROM blocks such as table based multiple and divide routines, the point is to save clock cycles ... while nothing like the extreme slow down of passing parameters embedded inline after the subroutine call, even 5 extra clocks is preferable to avoid.

There is no JMP (addr,X) on the 6502. There are only absolute and indirect jumps, no indexed modes.

It looks like the 65C02 does have an indirect, indexed jump, so that is faster - so you're right, it's not terrible on the 65C02 (and presumably the 65816.)

And yeah - I did the math on embedding parameters inline after the subroutine call. It's ridiculous. A huge chunk of the overhead in the DOS code is simply bank switching; I'd really like to see a more efficient bank switching mechanism, because the one they're currently using is at least 50 instructions... and when you're doing that for every character of a file read, that's a ridiculous amount of overhead. (I counted the number of steps to read a single byte from SD, and it's something like 180 instructions.)

I'm thinking that maybe the most commonly called functions need to be set up with dedicated function calls... certainly the "read from disk" stuff, since that's horribly expensive to do on every single byte.

BruceMcF · Post by **BruceMcF** » Tue Jan 26, 2021 11:01 am

33 minutes ago, TomXP411 said:

There is no JMP (addr,X) on the 6502. There are only absolute and indirect jumps, no indexed modes.

It looks like the 65C02 does have an indirect, indexed jump, so that is faster - so you're right, it's not terrible on the 65C02 (and presumably the 65816.)

And yeah - I did the math on embedding parameters inline after the subroutine call. It's ridiculous. A huge chunk of the overhead in the DOS code is simply bank switching; I'd really like to see a more efficient bank switching mechanism, because the one they're currently using is at least 50 instructions... and when you're doing that for every character of a file read, that's a ridiculous amount of overhead. (I counted the number of steps to read a single byte from SD, and it's something like 180 instructions.)

I'm thinking that maybe the most commonly called functions need to be set up with dedicated function calls... certainly the "read from disk" stuff, since that's horribly expensive to do on every single byte.

One approach is to have a short trampoline entry point and to specify that the toolbox ROM is called WITH the Kernel BANK in place ... which avoids the need to save the current bank. The lowest overhead for that is for Toolbox routines to end with a call to "ToolEnd".

ToolStart: LDA #ToolROM : STA ROMBank : JMP ($C000,X)

ToolEnd: LDA #KernelROM : STA ROMBank : RTS

I think that is about net 20 clocks overhead (counting JMP ToolEnd, not counting the common RTS somewhere for all approaches), and Kernel routines are called without trampolining.

That also implies that the vectors on the Kernel ROM don't need a JMP ($C0xx,X) in the ROM, so it can start with just a JMP INITIALIZE to set up the trampoline.

An advantage of that approach is that all Toolkit ROMs would have a common exit routine, so it might be placed in a fixed location, like $7FFB, consuming #ToolKits*7 + 5 bytes of Golden RAM.

------------------------

NB. Yes, the 65816 has the same opcode ... it's mostly the zero page single bit manipulation opcodes in the 65C02 that are omitted to make room for the "dual 8bit" chaining operations abilities.