Megalextoria: Apple II Emulation » Thoughts about JIT 65SC02 emulation?

Home » Digital Archaeology » Computer Arcana » Apple » Apple II Emulation » Thoughts about JIT 65SC02 emulation?

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

Switch to threaded view of this topic

Create a new topic

Submit Reply

Thoughts about JIT 65SC02 emulation? [message #394747]

Wed, 20 May 2020 17:55

Steve Nickolas is currently offline

Steve Nickolas
Messages: 2036
Registered: October 2012

Karma: 0

Senior Member

So I've been debating getting another XT or AT, and wondering about ways
to get them to do reasonable emulation of an Apple ][. (I know apl2em
will run at a reasonable speed on a 286/12.) My conclusion was that if it
is possible at all, it would require a dynamic recompiler or just-in-time
recompiler. Of course, this is out of my league.

What I'm thinking of doing is using some sort of precompiled translated
ROM, implement ProDOS-8 in the emulator itself, and then implement a 128K
Apple //e on the most minimal possible environment.

-uso.

Report message to a moderator

Send a private message to this user

Re: Thoughts about JIT 65SC02 emulation? [message #394770 is a reply to message #394747]

Thu, 21 May 2020 11:34

Anonymous

Karma:

Originally posted by: fadden

On Wednesday, May 20, 2020 at 2:56:00 PM UTC-7, Steve Nickolas wrote:
> So I've been debating getting another XT or AT, and wondering about ways
> to get them to do reasonable emulation of an Apple ][. (I know apl2em
> will run at a reasonable speed on a 286/12.) My conclusion was that if it
> is possible at all, it would require a dynamic recompiler or just-in-time
> recompiler. Of course, this is out of my league.

Do you have stats on how much of emulation performance is instruction fetch/decode/execute vs. other emulation tasks?

You still need to track elapsed cycles, check for accesses to memory-mapped I/O locations, update the hi-res screen, etc. I'm wondering how much benefit can be derived from compilation, particularly for hand-written 6502 assembler. Much of it may not decompose into nice basic blocks. Some performance-critical routines use self-modifying code, potentially negating the benefit.

In any event, see:
https://web.archive.org/web/20110618083246/http://altdevblog aday.org/2011/06/12/jit-cpu-emulation-a-6502-to-x86-dynamic- recompiler-part-1/

Report message to a moderator

Re: Thoughts about JIT 65SC02 emulation? [message #394851 is a reply to message #394770]

Sat, 23 May 2020 10:11

TomCh is currently offline

TomCh
Messages: 242
Registered: November 2012

Karma: 0

Senior Member

On Thursday, 21 May 2020 16:34:10 UTC+1, fadden wrote:
> On Wednesday, May 20, 2020 at 2:56:00 PM UTC-7, Steve Nickolas wrote:
>> So I've been debating getting another XT or AT, and wondering about ways
>> to get them to do reasonable emulation of an Apple ][. (I know apl2em
>> will run at a reasonable speed on a 286/12.) My conclusion was that if it
>> is possible at all, it would require a dynamic recompiler or just-in-time
>> recompiler. Of course, this is out of my league.
>
> Do you have stats on how much of emulation performance is instruction fetch/decode/execute vs. other emulation tasks?
>
> You still need to track elapsed cycles, check for accesses to memory-mapped I/O locations, update the hi-res screen, etc. I'm wondering how much benefit can be derived from compilation, particularly for hand-written 6502 assembler. Much of it may not decompose into nice basic blocks. Some performance-critical routines use self-modifying code, potentially negating the benefit.
>
> In any event, see:
> https://web.archive.org/web/20110618083246/http://altdevblog aday.org/2011/06/12/jit-cpu-emulation-a-6502-to-x86-dynamic- recompiler-part-1/

I also wonder how much benefit you'd get from compilation/JIT emulation.

And I agree that surely most(?) games/demos/productivity(?) s/w use SMC (self-modifying code) for better performance, which as you say may negate the effect of any cached recompiled code.

Anyway, I was curious about the breakdown of time for AppleWin. Here are some quick numbers for the latest AppleWin 1.29.12 (release build, w/out Visual Studio attached, RGB video) running French Touch's MadEffect2 demo:

Perf breakdown:
.. CPU % = 18.217025.2
.. Video % = 68.534746.2
.... NTSC % = 61.095756.2
.... refresh % = 7.438990.2
.. Audio % = 11.952201.2
.... Speaker % = 2.079797.2
.... MB % = 9.872404.2
.. Other % = 1.296029.2
.. TOTAL % = 100.000000.2

I just put markers in the main emu loop, where it does useful work; but didn't include the time spent in the big "sleep" which is used to peg emulation speed to 1MHz (or whatever MHz speed you selected).

NB. CPU% can vary a lot from title-to-title depending on things like heavy bank-switching (which is just memcpy) but can quickly impact performance if the 6502 is written efficiently for emulators! :-)

Heavy bank-switching (memcpy) will surely dominate even recompiled code, meaning no benefit for compiled over interpreted 6502, for these cases.

AppleWin is single-threaded and in release-build it only consumes ~2% of an AMD64 CPU core (3GHz). So I could offload some tasks to other compute units (eg. offload video to the GPU, and probably audio to another CPU or sound card), but there's no need for this added complexity when it runs fast enough already.

btw. in AppleWin's "full-speed" mode, then video isn't updated per opcode, instead just once a video frame. Also audio is muted. So CPU% tends to ~95%, video is just refresh (~5%).

Tom

Report message to a moderator

Send a private message to this user

Re: Thoughts about JIT 65SC02 emulation? [message #394852 is a reply to message #394851]

Sat, 23 May 2020 10:14

TomCh is currently offline

TomCh
Messages: 242
Registered: November 2012

Karma: 0

Senior Member

On Saturday, 23 May 2020 15:11:09 UTC+1, TomCh wrote:
> NB. CPU% can vary a lot from title-to-title depending on things like heavy bank-switching (which is just memcpy) but can quickly impact performance if the 6502 is written efficiently for emulators! :-)
>

typo:

but can quickly impact performance if the 6502 [code] is *NOT* written efficiently for emulators! :-)

Report message to a moderator

Send a private message to this user

Re: Thoughts about JIT 65SC02 emulation? [message #394853 is a reply to message #394851]

Sat, 23 May 2020 10:50

Anonymous

Karma:

Originally posted by: fadden

On Saturday, May 23, 2020 at 7:11:09 AM UTC-7, TomCh wrote:
> AppleWin is single-threaded and in release-build it only consumes ~2% of an AMD64 CPU core (3GHz). So I could offload some tasks to other compute units (eg. offload video to the GPU, and probably audio to another CPU or sound card), but there's no need for this added complexity when it runs fast enough already.

It also wouldn't help the OP's scenario of running the emulator on an old 80286 machine, which is single-core without a GPU.

I get twitchy when people talk about adding threads to speed up closely-coupled tasks. The performance gains are often minor because the threads spend a lot of time waiting on each other, and there's a dramatic spike in development complexity, notably because it creates a whole new class of bug to hunt down. On the bright side, the x86 memory model is relatively strong.

Report message to a moderator

Re: Thoughts about JIT 65SC02 emulation? [message #394859 is a reply to message #394852]

Sat, 23 May 2020 11:35

TomCh is currently offline

TomCh
Messages: 242
Registered: November 2012

Karma: 0

Senior Member

On Saturday, 23 May 2020 15:14:15 UTC+1, TomCh wrote:
> On Saturday, 23 May 2020 15:11:09 UTC+1, TomCh wrote:
>> NB. CPU% can vary a lot from title-to-title depending on things like heavy bank-switching (which is just memcpy) but can quickly impact performance if the 6502 is written efficiently for emulators! :-)
>>
>
> typo:
>
> but can quickly impact performance if the 6502 [code] is *NOT* written efficiently for emulators! :-)

Additionally: since video dominates CPU (here by a factor of ~4) then the focus should be in speeding up the video part, not the CPU part... assuming you want opcode-accurate video, eg. for beam-racing demos.

Report message to a moderator

Send a private message to this user

Re: Thoughts about JIT 65SC02 emulation? [message #394912 is a reply to message #394859]

Sun, 24 May 2020 16:42

Steve Nickolas is currently offline

Steve Nickolas
Messages: 2036
Registered: October 2012

Karma: 0

Senior Member

On Sat, 23 May 2020, TomCh wrote:

> On Saturday, 23 May 2020 15:14:15 UTC+1, TomCh wrote:
>> On Saturday, 23 May 2020 15:11:09 UTC+1, TomCh wrote:
>>> NB. CPU% can vary a lot from title-to-title depending on things like heavy bank-switching (which is just memcpy) but can quickly impact performance if the 6502 is written efficiently for emulators! :-)
>>>
>>
>> typo:
>>
>> but can quickly impact performance if the 6502 [code] is *NOT* written efficiently for emulators! :-)
>
> Additionally: since video dominates CPU (here by a factor of ~4) then
> the focus should be in speeding up the video part, not the CPU part...
> assuming you want opcode-accurate video, eg. for beam-racing demos.
>

That would be about hopeless. I don't really expect better video
emulation on a 286 than apl2em.

-uso.

Report message to a moderator

Send a private message to this user

Re: Thoughts about JIT 65SC02 emulation? [message #394924 is a reply to message #394851]

Mon, 25 May 2020 02:04

Anonymous

Karma:

Originally posted by: kegs

In article <85244a0e-197a-4f6d-ad37-1a471521f878@googlegroups.com>,
TomCh <tomcharlesworth26@gmail.com> wrote:
[snip]
> I also wonder how much benefit you'd get from compilation/JIT emulation.
>
> And I agree that surely most(?) games/demos/productivity(?) s/w use SMC
> (self-modifying code) for better performance, which as you say may
> negate the effect of any cached recompiled code.
>
> Anyway, I was curious about the breakdown of time for AppleWin. Here are
> some quick numbers for the latest AppleWin 1.29.12 (release build, w/out
> Visual Studio attached, RGB video) running French Touch's MadEffect2
> demo:
>
> Perf breakdown:
> . CPU % = 18.217025.2
> . Video % = 68.534746.2
> ... NTSC % = 61.095756.2
> ... refresh % = 7.438990.2
> . Audio % = 11.952201.2
> ... Speaker % = 2.079797.2
> ... MB % = 9.872404.2
> . Other % = 1.296029.2
> . TOTAL % = 100.000000.2
>
> I just put markers in the main emu loop, where it does useful work; but
> didn't include the time spent in the big "sleep" which is used to peg
> emulation speed to 1MHz (or whatever MHz speed you selected).

Applewin has a straightforward approach to how to do emulation--to do a
lot of work each instruction so the machine state makes sense. Here's the
main loop in Applewin cpu6502.h:

do {
if(GetActiveCpu() == CPU_Z80) {
// do z80 stuff
} else {
Fetch(iOpcode, uExecutedCycles);
switch(iOpcode) {
....
case 0x09: IMM ORA CYC(2) break;
...
}
}
CheckInterruptSources(uExecutedCycles, bVideoUpdate);
NMI(...);
IRQ(...)
if(bVideoUpdate) {
NTSC_VideoUpdateCycles();
}
} while (uExecutedCycles < uTotalCycles);

This is a lot of work. It can be made as simple as:

if(z80) {
// z80 loop
} else while(1) {
if(uExecutedCycles >= uTotalCycles) {
// Some event is pending...handle it. Maybe return.
}
Fetch(iOpcode);
switch(iOpcode) {
....
}
}

where now any NMI/IRQ/Video handling operation would need to change
the variable uTotalCycles (likely to '0') to kick in the slower processing.
And don't do video updates each instruction--push them off to do them
in batches. You can get act as if you operated after each instruction
if you want to put in the complexity.

The problem is this can get complex quickly--IRQs/video changes/sounds/etc.
must be logged elsewhere (defering doing the work until later) and pushed
onto the event queue if it needs an interrupt.

KEGS (an Apple IIgs emulator, so it's emulating a 65816) uses this as its
main loop:

while(fcycles < g_fcycles_stop) {
opcode = FETCH();
switch(opcode) {
....
}
}

where the event queue logic changes the global variable g_fcycles_stop to
indicate the CPU instruction sequence needs to stop and do other processing.

I looked at the source for apl2em. It's pretty good, but it can be made
faster by: removing the delay after each instruction in Fetch; removing the
65c02 check in each instruction.

Here's Fetch from apl2em:

Fetch Proc Near
lodsb ; Load the next opcode byte
xor ah,ah ; Convert opcode to full word
shl ax,OP_BASE ; Convert opcode to address
mov di,ax ; Move routine address to DI
test dh,CPU_R ; Check for emulator interrupt
jnz _int ; Jump if emulator interrupt request
mov ax,cs:[Delay] ; Load delay counter
Wait: dec ax
jns Wait ; Loop until delay finished
jmp di ; Jump to correct opcode routine
_int:
Interrupt ; Go process emulator interrupt (Int 3)
Fetch Endp

Toss the "Wait" stuff, and it's not bad.

Fetch Proc Near
lodsb ; 5 cycles
xor ah,ah ; 2 cycles
shl ax,OP_BASE ; 5+6 cycles
mov di,ax ; 2 cycles
test dh,CPU_R ; 6 cycles
jnz _int ; 3 cycles if not taken
jmp di ; 7 cycles
_int:
Interrupt ; Go process emulator interrupt (Int 3)
Fetch Endp

So that's a total of 36 clocks in Fetch. If you want to emulate a 6502 at
1MHz, then since an average 6502 instruction takes 3 cycles (just a starting
estimate), you need to emulate an instruction within 3 microseconds. A
286 running at 12MHz would use 3 microseconds just in Fetch. But: it will
be hard to even emulate instructions in 36 cycles. The 286 is slow.

Fetch could be changed to take about 9 cycles less pretty easily:

lodsb ; 5
mov ah,al ; 2
xor al,al ; 2 optional...
test dh,CPU_R ; 2
jnz _int ; 3
jmp ax ; 7

This gets it down to 19 cycles. The xor isn't needed if we're OK jumping to
base+0xffff. The original code jumped to (opcode * 0x40). This code jumps
to (opcode * 0x101). This is a slight problem for opcode 0xff. This code
also requires someone who cares about 8086 segment registers to figure out
where "jmp ax" will go. We may need the "mov di,ax" back to do "jmp di"
to use the right segment register.

JIT can remove the lodsb,mov,xor,jmp, saving 14 cycles. But we still likely
need the "test, jnz" which takes 5 cycles.

Let's look at apl2em's instructions to do LDA $2000. Let's assume we
got the opcode already.

DoAbs
which is:
lodsw ; 5 cycles
mov di,ax ; 2 cycles
OpLDA
which is:
call Read_Mem ; 7 cycles
mov dl,al ; 2 cycles
or dl,dl ; 2 cycles
lahf ; 2 cycles
and ah,CPU_N + CPU_Z ; 3 cycles
and dh,Not(CPU_N + CPU_Z) ; 3 cycles
or dh,ah ; 2 cycles

This is 28 cycles. And we haven't actually read the memory, we need
the Read_Memory routine.

Here's Read_memory:
mov bp,di ; 2 cycles
shr bp,8 ; 5 + 8 cycles
shl bp,1 ; 3 cycles
jmp cs:[bp + Read_table] ; 11 cycles?

Where the routine from Read_Table is:
mov al,ds:[di] ; 5 cycles (needs to do: call Read_mem)
ret ; 11 cycles

This can be optimized, but it's a start. Read_memory takes: 45 cycles.
The instruction itself takes 73 clocks. JIT can replace the DoAbs with
"mov di,$2000" which takes 2 cycles. This saves 5 clocks.

So, no JIT: 19+73 = 92 cycles. With JIT: 5+68 = 73 cycles.

So, the simple JIT likely saves just 26%. And that's assuming a perfect JIT.
You'd need to add self-modifying code detection, which may make JIT
a wash. A more complex JIT, one smart enough to try to optimize out
Read_Memory calls and flag setting, etc. could do better. But the
time to generate this code becomes very non-trivial. And it's conceptually
complex.

Basically, the 286 is very slow. With or without JIT, I don't think you can
emulate a 1MHz 6502 with the special handling an Apple II emulator needs.

Kent

Report message to a moderator

Re: Thoughts about JIT 65SC02 emulation? [message #394933 is a reply to message #394924]

Mon, 25 May 2020 04:22

TomCh is currently offline

TomCh
Messages: 242
Registered: November 2012

Karma: 0

Senior Member

On Monday, 25 May 2020 07:04:33 UTC+1, Kent Dickey wrote:
> In article <85244a0e-197a-4f6d-ad37-1a471521f878@googlegroups.com>,
> TomCh wrote:
> [snip]
>> I also wonder how much benefit you'd get from compilation/JIT emulation.
>>
>> And I agree that surely most(?) games/demos/productivity(?) s/w use SMC
>> (self-modifying code) for better performance, which as you say may
>> negate the effect of any cached recompiled code.
>>
>> Anyway, I was curious about the breakdown of time for AppleWin. Here are
>> some quick numbers for the latest AppleWin 1.29.12 (release build, w/out
>> Visual Studio attached, RGB video) running French Touch's MadEffect2
>> demo:
>>
>> Perf breakdown:
>> . CPU % = 18.217025.2
>> . Video % = 68.534746.2
>> ... NTSC % = 61.095756.2
>> ... refresh % = 7.438990.2
>> . Audio % = 11.952201.2
>> ... Speaker % = 2.079797.2
>> ... MB % = 9.872404.2
>> . Other % = 1.296029.2
>> . TOTAL % = 100.000000.2
>>
>> I just put markers in the main emu loop, where it does useful work; but
>> didn't include the time spent in the big "sleep" which is used to peg
>> emulation speed to 1MHz (or whatever MHz speed you selected).
>
> Applewin has a straightforward approach to how to do emulation--to do a
> lot of work each instruction so the machine state makes sense. Here's the
> main loop in Applewin cpu6502.h:
>
> do {
> if(GetActiveCpu() == CPU_Z80) {
> // do z80 stuff
> } else {
> Fetch(iOpcode, uExecutedCycles);
> switch(iOpcode) {
> ....
> case 0x09: IMM ORA CYC(2) break;
> ...
> }
> }
> CheckInterruptSources(uExecutedCycles, bVideoUpdate);
> NMI(...);
> IRQ(...)
> if(bVideoUpdate) {
> NTSC_VideoUpdateCycles();
> }
> } while (uExecutedCycles < uTotalCycles);
>
> This is a lot of work. It can be made as simple as:
>
> if(z80) {
> // z80 loop
> } else while(1) {
> if(uExecutedCycles >= uTotalCycles) {
> // Some event is pending...handle it. Maybe return.
> }
> Fetch(iOpcode);
> switch(iOpcode) {
> ....
> }
> }
>
> where now any NMI/IRQ/Video handling operation would need to change
> the variable uTotalCycles (likely to '0') to kick in the slower processing.
> And don't do video updates each instruction--push them off to do them
> in batches. You can get act as if you operated after each instruction
> if you want to put in the complexity.
>
> The problem is this can get complex quickly--IRQs/video changes/sounds/etc.
> must be logged elsewhere (defering doing the work until later) and pushed
> onto the event queue if it needs an interrupt.
>
> [snip]

Hi Kent - great to get your perspective on this.

re. AppleWin: Both the event-queue for IRQs & deferred video rendering ideas have been discussed amongst us: Nick W has bent my ear about the former on a few occasions. But modern machines are fast enough to currently warrant prioritising the effort on this.

The AppleWin 4:1 ratio for video:CPU was a bit of a surprise, so trying the deferred video would be somewhat interesting, just to see how it re-balances this ratio.

Tom

Report message to a moderator

Send a private message to this user

Re: Thoughts about JIT 65SC02 emulation? [message #394934 is a reply to message #394924]

Mon, 25 May 2020 04:32

TomCh is currently offline

TomCh
Messages: 242
Registered: November 2012

Karma: 0

Senior Member

On Monday, 25 May 2020 07:04:33 UTC+1, Kent Dickey wrote:
> In article <85244a0e-197a-4f6d-ad37-1a471521f878@googlegroups.com>,
> TomCh wrote:
> [snip]
>> I also wonder how much benefit you'd get from compilation/JIT emulation.
>>
>> And I agree that surely most(?) games/demos/productivity(?) s/w use SMC
>> (self-modifying code) for better performance, which as you say may
>> negate the effect of any cached recompiled code.
>>
>> Anyway, I was curious about the breakdown of time for AppleWin. Here are
>> some quick numbers for the latest AppleWin 1.29.12 (release build, w/out
>> Visual Studio attached, RGB video) running French Touch's MadEffect2
>> demo:
>>
>> Perf breakdown:
>> . CPU % = 18.217025.2
>> . Video % = 68.534746.2
>> ... NTSC % = 61.095756.2
>> ... refresh % = 7.438990.2
>> . Audio % = 11.952201.2
>> ... Speaker % = 2.079797.2
>> ... MB % = 9.872404.2
>> . Other % = 1.296029.2
>> . TOTAL % = 100.000000.2
>>
>> I just put markers in the main emu loop, where it does useful work; but
>> didn't include the time spent in the big "sleep" which is used to peg
>> emulation speed to 1MHz (or whatever MHz speed you selected).
>
> Applewin has a straightforward approach to how to do emulation--to do a
> lot of work each instruction so the machine state makes sense. Here's the
> main loop in Applewin cpu6502.h:
>
> do {
> if(GetActiveCpu() == CPU_Z80) {
> // do z80 stuff
> } else {
> Fetch(iOpcode, uExecutedCycles);
> switch(iOpcode) {
> ....
> case 0x09: IMM ORA CYC(2) break;
> ...
> }
> }
> CheckInterruptSources(uExecutedCycles, bVideoUpdate);
> NMI(...);
> IRQ(...)
> if(bVideoUpdate) {
> NTSC_VideoUpdateCycles();
> }
> } while (uExecutedCycles < uTotalCycles);
>
> This is a lot of work. It can be made as simple as:
>
> if(z80) {
> // z80 loop
> } else while(1) {
> if(uExecutedCycles >= uTotalCycles) {
> // Some event is pending...handle it. Maybe return.
> }
> Fetch(iOpcode);
> switch(iOpcode) {
> ....
> }
> }
>
> where now any NMI/IRQ/Video handling operation would need to change
> the variable uTotalCycles (likely to '0') to kick in the slower processing.
> And don't do video updates each instruction--push them off to do them
> in batches. You can get act as if you operated after each instruction
> if you want to put in the complexity.
>
> The problem is this can get complex quickly--IRQs/video changes/sounds/etc.
> must be logged elsewhere (defering doing the work until later) and pushed
> onto the event queue if it needs an interrupt.
>

Hi Kent - great to get your perspective on this.

re. AppleWin: Both the event-queue for IRQs & deferred video rendering ideas have been discussed amongst us: Nick W has bent my ear about the former on a few occasions. But modern machines are fast enough to currently not warrant prioritising the effort on this.

The AppleWin 4:1 ratio for video:CPU was a bit of a surprise, so trying the deferred video would be somewhat interesting, just to see how it re-balances this ratio.

Tom

Report message to a moderator

Send a private message to this user

Re: Thoughts about JIT 65SC02 emulation? [message #394942 is a reply to message #394924]

Mon, 25 May 2020 11:01

Anonymous

Karma:

Originally posted by: fadden

On Sunday, May 24, 2020 at 11:04:33 PM UTC-7, Kent Dickey wrote:
> while(fcycles < g_fcycles_stop) {
> opcode = FETCH();
> switch(opcode) {
> ....
> }
> }

You can potentially gain a little more by switching to a threaded interpreter. The above is:

fetch instruction byte
look up opcode handler address in switch table
branch to it
execute handler
branch to top of loop

If you put the opcode fetch at the end of every opcode handler, and make each handler a fixed number of bytes, you can change each step to:

fetch instruction byte
branch to computed address
execute handler

So one fewer branch and no handler address lookup. You have to pick the size of the handler code carefully; too big and you're wasting i-cache, too small and too many of the handlers will need to jump away and back. Ideally it's a power of 2 so you can just shift the opcode and add it to the base address.

You may be able to skip video updates and timing checks from some instruction handlers if you're less worried about cycle-accurate behavior. While using a long series of STA instructions to update the screen is common, I've never seen a long series of ADCs. So the ADC handler just does the add and continues to the next instruction, and hopes that some other instruction will be executed shortly and nobody will notice if the ride gets bumpy.

You can JIT concatenate the handlers together without the fetch/branch part at the end to avoid the overhead, which is a huge win on architectures where a failed branch predict is bad. (IIRC the papers that were playing with this technique were using a Pentium 4.) Because it's not a full compile, the approach is tolerant of self-modifying code that alters operands, but not instructions. Well-behaved code will tend to modify the same locations repeatedly, so a mechanism that marks regions as "do not compile" can be effective.

Or you just throw a 4GHz CPU at the problem and call it a day. :-)

Report message to a moderator

Re: Thoughts about JIT 65SC02 emulation? [message #394943 is a reply to message #394934]

Mon, 25 May 2020 12:27

Anonymous

Karma:

Originally posted by: kegs

In article <706ab810-724d-4fd3-b256-682ebbabdeba@googlegroups.com>,
TomCh <tomcharlesworth26@gmail.com> wrote:
> On Monday, 25 May 2020 07:04:33 UTC+1, Kent Dickey wrote:
[snip]
>> This is a lot of work. It can be made as simple as:
>>
>> if(z80) {
>> // z80 loop
>> } else while(1) {
>> if(uExecutedCycles >= uTotalCycles) {
>> // Some event is pending...handle it. Maybe return.
>> }
>> Fetch(iOpcode);
>> switch(iOpcode) {
>> ....
>> }
>> }
>>
>> where now any NMI/IRQ/Video handling operation would need to change
>> the variable uTotalCycles (likely to '0') to kick in the slower processing.
>> And don't do video updates each instruction--push them off to do them
>> in batches. You can get act as if you operated after each instruction
>> if you want to put in the complexity.
>>
>> The problem is this can get complex quickly--IRQs/video changes/sounds/etc.
>> must be logged elsewhere (defering doing the work until later) and pushed
>> onto the event queue if it needs an interrupt.
>>
>
> Hi Kent - great to get your perspective on this.
>
> re. AppleWin: Both the event-queue for IRQs & deferred video rendering
> ideas have been discussed amongst us: Nick W has bent my ear about the
> former on a few occasions. But modern machines are fast enough to
> currently not warrant prioritising the effort on this.
>
> The AppleWin 4:1 ratio for video:CPU was a bit of a surprise, so trying
> the deferred video would be somewhat interesting, just to see how it
> re-balances this ratio.
>
> Tom

Just to be clear, I don't think Applewin should change. It would make the
code much more complex, and it doesn't really get you anything. It would
allow you to run "full speed" at an effective 500MHz instead of 300MHz (or
something like that). Who cares? I was just pointing out that there are
other ways to do this, and that one shouldn't assume all emulators have to
work this way.

Kent

Report message to a moderator

Re: Thoughts about JIT 65SC02 emulation? [message #394952 is a reply to message #394942]

Mon, 25 May 2020 12:45

Anonymous

Karma:

Originally posted by: kegs

In article <a6aae07e-d654-4d97-b6b6-f0f94fccd096@googlegroups.com>,
fadden <thefadden@gmail.com> wrote:
> On Sunday, May 24, 2020 at 11:04:33 PM UTC-7, Kent Dickey wrote:
>> while(fcycles < g_fcycles_stop) {
>> opcode = FETCH();
>> switch(opcode) {
>> ....
>> }
>> }
>
> You can potentially gain a little more by switching to a threaded
> interpreter. The above is:
>
> fetch instruction byte
> look up opcode handler address in switch table
> branch to it
> execute handler
> branch to top of loop
>
> If you put the opcode fetch at the end of every opcode handler, and make
> each handler a fixed number of bytes, you can change each step to:
>
> fetch instruction byte
> branch to computed address
> execute handler
>
> So one fewer branch and no handler address lookup. You have to pick the
> size of the handler code carefully; too big and you're wasting i-cache,
> too small and too many of the handlers will need to jump away and back.
> Ideally it's a power of 2 so you can just shift the opcode and add it to
> the base address.
>
> You may be able to skip video updates and timing checks from some
> instruction handlers if you're less worried about cycle-accurate
> behavior. While using a long series of STA instructions to update the
> screen is common, I've never seen a long series of ADCs. So the ADC
> handler just does the add and continues to the next instruction, and
> hopes that some other instruction will be executed shortly and nobody
> will notice if the ride gets bumpy.
>
> You can JIT concatenate the handlers together without the fetch/branch
> part at the end to avoid the overhead, which is a huge win on
> architectures where a failed branch predict is bad. (IIRC the papers
> that were playing with this technique were using a Pentium 4.) Because
> it's not a full compile, the approach is tolerant of self-modifying code
> that alters operands, but not instructions. Well-behaved code will tend
> to modify the same locations repeatedly, so a mechanism that marks
> regions as "do not compile" can be effective.
>
> Or you just throw a 4GHz CPU at the problem and call it a day. :-)

I'm not sure modern CPUs really handle computed branch addresses any better
than loading an address from a table and jumping indirectly through it.

However, your proposal has another benefit which I think would help.
Emulators do have a problem with the dispatch branch being
mispredicated constantly. Having each instruction fetch the next opcode and
do a computed-goto (allowed by GCC and I think LLVM only, it's not a C
standard) would create more branches for the CPU to be able to predict--so
that ADC followed by STA could be correctly predicted much of the time if
the 6502 instruction stream often sees ADC followed by STA. The branch at
the end of the ADC instruction to the next opcode would be predicted
independently from the branch at the end of any other opcode, so the host
CPU prediction logic could learn patterns.

Like this:

void *label[256];
label[0] = label00;
label[1] = label01;
....

opcode = next_opcode();
goto label[opcode];

....
label65:
do ADC Dloc
if(events) {
return;
}
opcode = next_opcode();
goto label[opcode];
....
label85:
do STA Dloc
if(events) {
return;
}
opcode = next_opcode();
goto label[opcode];

This would likely be an improvement. But to use it, you'd be locking
yourself to GCC/LLVM (or worse, write it in assembly).

Kent

Report message to a moderator

Re: Thoughts about JIT 65SC02 emulation? [message #394961 is a reply to message #394952]

Mon, 25 May 2020 15:11

Anonymous

Karma:

Originally posted by: fadden

On Monday, May 25, 2020 at 9:45:15 AM UTC-7, Kent Dickey wrote:
> I'm not sure modern CPUs really handle computed branch addresses any better
> than loading an address from a table and jumping indirectly through it.

Agreed; the advantage is primarily in avoiding an additional load from memory. With the basic approach you have to load the opcode, then immediately load the handler address from the switch table, so you can't branch until two serialized loads finish. The switch table should be in the fastest cache the CPU has, so you won't wait for long, but it's still a bubble unless the compiler is able to schedule other stuff in.

Report message to a moderator

Re: Thoughts about JIT 65SC02 emulation? [message #394977 is a reply to message #394961]

Mon, 25 May 2020 21:29

Anonymous

Karma:

Originally posted by: kegs

In article <a7582e6d-6894-4a80-9550-802ce9bba0c5@googlegroups.com>,
fadden <thefadden@gmail.com> wrote:
> On Monday, May 25, 2020 at 9:45:15 AM UTC-7, Kent Dickey wrote:
>> I'm not sure modern CPUs really handle computed branch addresses any better
>> than loading an address from a table and jumping indirectly through it.
>
> Agreed; the advantage is primarily in avoiding an additional load from
> memory. With the basic approach you have to load the opcode, then
> immediately load the handler address from the switch table, so you can't
> branch until two serialized loads finish. The switch table should be in
> the fastest cache the CPU has, so you won't wait for long, but it's
> still a bubble unless the compiler is able to schedule other stuff in.

I know you know this, but let me explain some basics for anyone else who
might be interested.

That's sort of true, sort of not true. It was definitely true on
in-order CPUs. But there aren't a lot of those these days.

Out-of-Order CPUs (basically, any desktop, most phones) run ahead, and
predict through branches when they can. So if the CPU can properly predict
the branch, then the operations before the branch, even if they directly
are needed to resolve the branch, don't really matter. It basically
guesses, and then figures out later if it was right. It doesn't wait to
guess, which is why the data dependency isn't necessarily a problem.

One pretty useful way to think of a CPU is that there are 30-50 instructions
in flight, and the CPU is fetching (and partially executing) 30-50
instructions ahead of where it's certain the instructions are done.
So, if it's waiting for a data cache miss to resolve some old instruction,
then the future instructions, which don't depend on that data cache miss,
are all executed "for free", and dependencies don't really matter. But
a mispredicted branch is a problem--it has to toss lots of work, and
go down a new path, and then dependencies matter again on these new
instructions.

That's what I was pointing out--the threaded code creates more branches,
and some of those can be predicted quite well, meaning it could be quite a
bit faster.

One thing that is still a limit is just the number of instructions--they
still have to be executed and retired, so fewer instructions are generally
better. Dependencies limit the execution resources that can be used,
and so they cause a speed limit over time. But often it can be amortized away,
and emulated instructions provide a lot of parallelism. Setting the
flags, counting clock cycles, etc. can all be done in parallel. KEGS
uses FP variables for cycle counting SOLELY because my main target platform,
a PA-RISC CPU that was in-order with limited ability to "dual-issue" (execute
two non-dependent instructions in parallel), had a nearly 100% ability to
issue an FP and an integer instruction in parallel. So the cycle
counting using FP registers was basically "free", as long as they were
sprinkled between integer instructions. At one time, KEGS did more FP
operations per second than some of the SpecFP suite (which often has a lot
of cache misses and so a lot of dead time, and some benchmarks have lots of
non-FP operations at times). For each 65816 instruction,
KEGS does the FP comparison, fcycles += 2, (and fcycles++ for each additional
clock cycle). This worked out to 4 FP ops per 3 emulated cycles, so a
60MHz CPU running a 65816 at 6MHz was doing 8 million Flops/second.
BURST running a 65816 at 200MHz now (the max, it's throttled, but any modern
CPU can easily hit this limit) is doing 266MFlops.

Kent

Report message to a moderator

Re: Thoughts about JIT 65SC02 emulation? [message #394991 is a reply to message #394924]

Tue, 26 May 2020 06:58

sicklittlemonkey is currently offline

sicklittlemonkey
Messages: 570
Registered: October 2012

Karma: 0

Senior Member

On Monday, 25 May 2020 16:04:33 UTC+10, Kent Dickey wrote:
> KEGS (an Apple IIgs emulator, so it's emulating a 65816) uses this as its
> main loop:
>
> while(fcycles < g_fcycles_stop) {
> opcode = FETCH();
> switch(opcode) {
> ....
> }
> }

Nice analysis, going deeper into modern CPU pipelines than I've delved. The rule of thumb which tallied with my experience in the distant past was the requirement for 10 x the processing power of what you were emulating.

Some of the suggestions here reminded me of an Apple II emulator a friend and I worked on in 1995 for the Acorn A5000 with an ARM 3. Here's the source for the fetch & execute.

CPU_ExecEventQ ROUT
(Event queue stuff branched to from below)
....

CPU_ExecEpilog
MOV lr, pc ;(save NZCV)
MOV r0, mcc, LSR #24
LDR r2, =Snd_Buffer ;inc sound cycle count
LDR r1, [r2, #0]
ADD r1, r1, r0
STR r1, [r2, #0]
;
BIC mcc, mcc, #&FF000000
SUBS mcc, mcc, r0
BMI CPU_ExecEventQ ;process event queue?
;
CPU_ExecProlog
TST rp, #CtlStatusFs ;process control?
BNE CPU_Control
TEQP lr, #0 ;(restore NZCV)
;
CPU_ExecInstrn
LDRB r0, [mpc], #1 ;fetch opcode
BIC r1, rsp, #&FF000000 ;execute opcode
LDR pc, [r1, r0, ASL #2]

So cycle count from last instruction in r0 subtracted from reg mcc to trigger the next event (a sorted queue with VSync always there as a backstop), check for any interrupt flags (possibly better handled by pushing an event looking at it now), restore 6502 P flags, then get the opcode and execute via lookup table:

CPU_6502__OpTbl DCD CPU_6502__BRK_00, CPU_6502__ORA_01, CPU_65N02_NOP_02
....

Example opcode:
CPU_6502__LDA_AD ;lda abs
OpcLda Abs, 4

All done via macros:

MACRO
OpcLda $opd, $dcc ;lda
MemAdr mea, $opd
MemRdB r0, mea, $opd
MOV ra, r0, LSL #24
TEQ ra, #0
ADD mcc, mcc, #($dcc << 24)
B CPU_ExecEpilog
MEND

Every instruction jumps back to CPU_ExecEpilog above.

> KEGS uses FP variables for cycle counting SOLELY because my main target platform, a PA-RISC CPU that was in-order with limited ability to "dual-issue" (execute two non-dependent instructions in parallel), had a nearly 100% ability to issue an FP and an integer instruction in parallel.

Hah, I'd always wondered about that. Did I miss a comment? ; - )

Cheers,
Nick.

Report message to a moderator

Send a private message to this user

Switch to threaded view of this topic

Create a new topic

Submit Reply

Previous Topic:	Printshop prints to PDF is this possible?
Next Topic:	microM8 doesn't display DHGR page 2

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

PDF

]

Current Time: Fri Apr 19 04:19:08 EDT 2024

Total time taken to generate the page: 2.68140 seconds