Paku Paku -- 1.6 released 9 November 2011
-
deathshadow60
- Posts: 62
- Joined: Mon Jan 10, 2011 5:17 am
- Location: Keene, NH
- Contact:
Paku Paku -- 1.6 released 9 November 2011
*** NOTE *** version 1.6 released
After eight months in development and testing, version 1.6 of Paku Paku has been released. I've also revamped my programming website to host it officially -- so from now on if you're looking for information about Paku Paku, the new site is:
http://www.deathshadow.com/
With the page about Paku Paku being here:
http://www.deathshadow.com/pakuPaku
You can also play the game live on that site if you have Java installed, using the java port of DOSBox.
http://www.deathshadow.com/pakuPakuLive
The new version adds a slew of bugfixes, new hardware support, and a much more compact and speedy executable. Check out the revision history:
http://www.deathshadow.com/pakuPaku_Revisions
for the full list of changes...
On modern systems this runs great in DosBox -- I highly suggest using the oplmode=cms and the /cms command line option for the best sounding version in DosBox.
After eight months in development and testing, version 1.6 of Paku Paku has been released. I've also revamped my programming website to host it officially -- so from now on if you're looking for information about Paku Paku, the new site is:
http://www.deathshadow.com/
With the page about Paku Paku being here:
http://www.deathshadow.com/pakuPaku
You can also play the game live on that site if you have Java installed, using the java port of DOSBox.
http://www.deathshadow.com/pakuPakuLive
The new version adds a slew of bugfixes, new hardware support, and a much more compact and speedy executable. Check out the revision history:
http://www.deathshadow.com/pakuPaku_Revisions
for the full list of changes...
On modern systems this runs great in DosBox -- I highly suggest using the oplmode=cms and the /cms command line option for the best sounding version in DosBox.
Last edited by deathshadow60 on Wed Nov 09, 2011 12:17 am, edited 9 times in total.
The only thing about Adobe web development products that can be considered professional grade tools are the people promoting their use.
Re: Paku Paku -- new DOS Game released
Well, we'll just have to try it out tonight on the real deal and see how it runs. Although if you are shooting for an 8Mhz AT with 16 bit video this might be painful.
Where is the bottleneck? Are you painting portions of the screen, xor'ing data, or what? (I have a little experience with fast screen updating ...)
And why on earth do you not have a PCjr yet?
Mike
Where is the bottleneck? Are you painting portions of the screen, xor'ing data, or what? (I have a little experience with fast screen updating ...)
And why on earth do you not have a PCjr yet?
Mike
-
deathshadow60
- Posts: 62
- Joined: Mon Jan 10, 2011 5:17 am
- Location: Keene, NH
- Contact:
Re: Paku Paku -- new DOS Game released
My original target was a 7mhz Tandy 1K, but the hardware just isn't up to the job of realtime sprites without flicker.Brutman wrote:Well, we'll just have to try it out tonight on the real deal and see how it runs. Although if you are shooting for an 8Mhz AT with 16 bit video this might be painful.
The only time you can draw to the screen without the risk of flicker (and snow on real CGA cards) is during a retrace period. The horizontal retrace is so short that by the time you detect that you're in it, it's already ended, leaving the vertical retrace as the only real option. Most of the bottleneck is that PC video cards have no built-in sprite handling, so it's all software.Brutman wrote:Where is the bottleneck? Are you painting portions of the screen, xor'ing data, or what? (I have a little experience with fast screen updating ...)
To draw a sprite without corrupting the background behind it you are stuck every time you draw it with restoring the original background, grabbing the background at the sprite's new location, then doing an AND mask to erase where you'll be drawing the sprite and then ORing the sprite's color data. You can't even take the XOR shortcut since this particular color mode is 4bpp packed in alternating stripes... (every other byte being a text mode character 0xDD). The pixel packing also means you need two copies of the sprite (since it takes to long to SHR 4 and carry on the fly across 6 bytes) in addition to the mask, and you have the overhead of figuring out which copy (even or odd aligned) you want to blit.
Net result? 30 byte write to restore background, 30 byte read for original background, 30 byte read/write AND, 30 byte read/write OR working out to 90 bytes read and 90 bytes written PER 5x5 sprite. With five (six if you count the bonus item) sprites that's 450 bytes in and 450 bytes out... and a 8mhz 8088 just cannot manage to do that inside the retrace period... in fact it generally only has enough time free to do about 18-28 bytes depending on the code overhead.
AND you have to erase all of them then draw all of them together in groups in the opposite order so that if they overlap their restored backgrounds don't screw it up.
A possible solution I'm investigating is breaking the screen into 'sections' and only drawing elements within those 'sections' a certain period of time after the retrace. For example when the retrace occurs you can draw the bottom half of the screen, then by the time that's done the retrace will be showing the bottom half letting you draw the top. The problem with this is what to do when sprites overlap across those boundaries -- I'm still not sure how I'm going to handle that. The overhead of dividing them up and checking their position for redraw could take so much time it impacts game performance.
It's a problem all text-mode applications on the CGA face thanks to the unbuffered memory.
A better fix would be do to paging, but because this is a glorified text mode with every two pixels taking up two bytes (high byte $DD, low byte broken into pixel colors) it ends up 16000 bytes -- and the CGA only has 16k of RAM. (though I may make a tandy/jr port that uses their native 160x200 mode that has a lot less of these problems)
Because I have a limited budget, limited space, and a working 1000 SX? Only way I'd have room for a Jr would be to move the Coco down to the garage again, and that's not happening in the dead of winter.Brutman wrote:And why on earth do you not have a PCjr yet?
The only thing about Adobe web development products that can be considered professional grade tools are the people promoting their use.
Re: Paku Paku -- new DOS Game released
Nicely done! I played it on the PCjr and it works just fine, although there is definitely some flickering, but it does not really distract from the gameplay. Tandy sound works perfectly BTW. Really a fine piece of programming 
Re: Paku Paku -- new DOS Game released
The PCjr doesn't suffer from CGA snow, so a PCjr version should not be burdened with the need to check for the vertical retrace. And it might make the game faster too because you won't have that dead time spinning waiting for it to start.
I should look at the code and stop asking questions. But there are some fun tricks you can use to improve performance. For example, even though the 8088 only has an eight bit bus it is far quicker to use a 16 bit read or write than to use two 8 bit read or writes. The string ops with a REP prefix can really speed things up. (Especially went filling memory. REP on a LODSW is kind of pointless.)
Michael Abrash did a great book on high performance graphics programming - it is full of tips and tricks.
Mike
I should look at the code and stop asking questions. But there are some fun tricks you can use to improve performance. For example, even though the 8088 only has an eight bit bus it is far quicker to use a 16 bit read or write than to use two 8 bit read or writes. The string ops with a REP prefix can really speed things up. (Especially went filling memory. REP on a LODSW is kind of pointless.)
Michael Abrash did a great book on high performance graphics programming - it is full of tips and tricks.
Mike
-
deathshadow60
- Posts: 62
- Joined: Mon Jan 10, 2011 5:17 am
- Location: Keene, NH
- Contact:
Re: Paku Paku -- new DOS Game released
The check isn't just for snow though -- because there are no hardware sprites you have to erase the element and redraw it (or at least that's the only reliable way to do sprites fast enough). if the retrace goes by while the sprite is erased, that's where the 'flicker' comes in. This is why modern graphics usually either use double buffering (two video buffers that you draw to one and show the other) or rely upon vSync for the redraw.Brutman wrote:The PCjr doesn't suffer from CGA snow, so a PCjr version should not be burdened with the need to check for the vertical retrace.
That's actually why I have the speed test in the code which shows a red "machine too slow" and disables the vSync... which ends up cute since it's similar to how the real pac man hardware starts up (though in their case it's a memory test).Brutman wrote:And it might make the game faster too because you won't have that dead time spinning waiting for it to start.
Which I'm using LODSW and MOVSW a lot. The problem is how the data is stored in this mode... Odd numbered bytes hav to remain 0xDD while even numbered bytes hold the two "pixels" as 4 bit packed... So word-sized operations can only do two pixels at a time even in a 4 bit per pixel mode. It's actually why the core of my 5x5 blit routine looks like this:Brutman wrote:For example, even though the 8088 only has an eight bit bus it is far quicker to use a 16 bit read or write than to use two 8 bit read or writes.
Code: Select all
lodsw
mov bx,ax
mov ax,es:[di]
and al,bh
or al,bl
stosw
Agreed, but there's just too much data to process in realtime with sprites since they are supposed to not erase the background beneath them allowing things to show through and still be there once it's moved somewhere else -- and there's NOT enough ram or time to redraw the original background.Brutman wrote:The string ops with a REP prefix can really speed things up.
99.99% of which is only useful on the 256 or higher color modes like mode 0x13 and the various Mode X flavors... Though his planar 4 bit info was very useful when I made my old 720x480 16 color VGA hack... It just doesn't apply to this graphics mode with the column interlace of characters in the middle of the pixel-stream.Brutman wrote:Michael Abrash did a great book on high performance graphics programming - it is full of tips and tricks.
Though yeah, his book is a great tool -- ranks right up there with Ferraro's "Programmers Guide to the EGA/VGA Cards" and is right next to it on my shelf.
The only thing about Adobe web development products that can be considered professional grade tools are the people promoting their use.
-
deathshadow60
- Posts: 62
- Joined: Mon Jan 10, 2011 5:17 am
- Location: Keene, NH
- Contact:
Re: Paku Paku -- new DOS Game released
This is cute - I started playing with the idea of a tandy/jr optimized version, cutting out all the auto-detection and other hardware support... and was going to rewrite it for 160x200...
But that mode is scanline interlaced! That makes it slow as hell to actually implement code for! Not only does it add an extra layer of complication to the address calculation, I'd have to write just as many bytes at two offsets.
At first it looked promising since I could write 4 pixels with just one "iteration" and three memory accesses -- letting each loop handle up to 8px at once. (and for anything more than 4px there's no reason to code less than 8px at a time)
But I'd have to scanline double either performing that same operation twice at the $2000 offset (wasteful) or by copying the upper buffer to the lower buffer thus:
Guess that's why you don't see a lot of "true" sprite engine games on 8088 class machines.
So I'm sticking with the funky semigraphics mode I guess... I do think I can eliminate the flicker though by implementing a software buffer, I'm just not sure how that's going to effect performance as I'm going to end up needing to build two 8k buffers (source and render -- kiss that 128k friendly memory footprint goodbye), write code for erasing from the source buffer as you eat pellets, and of course the final rendering. It would allow me to pre-build the images before blitting them out to the display area though -- and if memory serves system RAM is usually faster than video RAM
CRUDSTUNK!!! Except on the Jr isn't system RAM is shared with Vid...
But that mode is scanline interlaced! That makes it slow as hell to actually implement code for! Not only does it add an extra layer of complication to the address calculation, I'd have to write just as many bytes at two offsets.
At first it looked promising since I could write 4 pixels with just one "iteration" and three memory accesses -- letting each loop handle up to 8px at once. (and for anything more than 4px there's no reason to code less than 8px at a time)
Code: Select all
mov cx,5
@loopWrite:
mov bx,es:[di]
lodsw
and bx,ax
lodsw
or ax,bx
stosw
mov bx,es:[di]
lodsw
and bx,ax
lodsw
or ax,bx
stosw
add di,78
loop @loopWrite
Code: Select all
mov di,dx { I stored DI's starting offset in DX during blitting }
mov dx,ds
mov ds,es
mov si,di
add di,$2000
mov cx,5
mov bx,78
@loopDupe:
movsw
movsw
add di,bx
add si,bx
loop @loopDupe
So I'm sticking with the funky semigraphics mode I guess... I do think I can eliminate the flicker though by implementing a software buffer, I'm just not sure how that's going to effect performance as I'm going to end up needing to build two 8k buffers (source and render -- kiss that 128k friendly memory footprint goodbye), write code for erasing from the source buffer as you eat pellets, and of course the final rendering. It would allow me to pre-build the images before blitting them out to the display area though -- and if memory serves system RAM is usually faster than video RAM
CRUDSTUNK!!! Except on the Jr isn't system RAM is shared with Vid...
The only thing about Adobe web development products that can be considered professional grade tools are the people promoting their use.
Re: Paku Paku -- new DOS Game released
Yup, on the Jr, the video RAM is reserved out of the system RAM. The size of the video RAM can be set with Jrconfig.dsk/Jrconfig.nrd if more or less than the default is needed, though.
-
deathshadow60
- Posts: 62
- Joined: Mon Jan 10, 2011 5:17 am
- Location: Keene, NH
- Contact:
Re: Paku Paku -- new DOS Game released -- VER 1.2!!!
Ok, if you folks could run a check on the new version 1.2 for me that would be greatly appreciated. I've completely revamped how the sprite engine worked by implementing those back-buffers I mentioned. It now needs 70k of free DOS memory, but the performance difference is night and day...
http://www.cutcodedown.com/retroGames/paku_1_2.rar
WAY better.
http://www.cutcodedown.com/retroGames/paku_1_2.rar
WAY better.
The only thing about Adobe web development products that can be considered professional grade tools are the people promoting their use.
Re: Paku Paku -- new DOS Game released -- VER 1.2!!!
Sorry about the delayed response - I've been unexpectedly busy. It looks like you have the basics of high performance code covered .. ;-0
On the Jr the first 128K of RAM is going to be penalized. The video hardware gets priority access to it compared to the CPU, so a PCjr running anything in the lower 128K is dog slow compared to a real PC. Memory access above 128K far better - when I configure my machines I always ensure that the lower 128K is fully consumed by video buffer and a RAM disk.
I have found out many times while coding my TCP/IP stack that buffering, while expensive in terms of memory, often beats computation. The 8088 has a hard time keeping it's prefetch buffer full, so any extra instructions hurt. Branching is devastating too. So while the extra memory is painful to use, it looks like it made a big difference.
I can't begin to tell you how much code I've written and thrown away over the years. Performance tuning is an ongoing experiment ..
On the Jr the first 128K of RAM is going to be penalized. The video hardware gets priority access to it compared to the CPU, so a PCjr running anything in the lower 128K is dog slow compared to a real PC. Memory access above 128K far better - when I configure my machines I always ensure that the lower 128K is fully consumed by video buffer and a RAM disk.
I have found out many times while coding my TCP/IP stack that buffering, while expensive in terms of memory, often beats computation. The 8088 has a hard time keeping it's prefetch buffer full, so any extra instructions hurt. Branching is devastating too. So while the extra memory is painful to use, it looks like it made a big difference.
I can't begin to tell you how much code I've written and thrown away over the years. Performance tuning is an ongoing experiment ..