Paku Paku -- 1.6 released 9 November 2011

Discussions on programming older machines

Re: Paku Paku -- new DOS Game released

Postby Trixter » Sat Mar 12, 2011 3:10 pm

Brutman wrote:Michael Abrash did a great book on high performance graphics programming - it is full of tips and tricks.


The better books that apply to our hobby are actually Abrash's "Zen Of Assembler", which covers optimization of the 8088, and Richard Wilton's "Programmer's Guide to PC Video Systems", which goes into (amongst many things) drawing pixels and lines on CGA with some extremely optimized code. It does not cover 160x100 tweaktext mode, nor tandy/pcjr, the CGA information can be extended to cover tandy/pcjr.

These books, including Abrash's Graphics Programming Black Book, are available here: ftp://ftp.oldskool.org/pub/misc/8088%20Programming.rar
The Graphics Programming Black Book is, as Jordan said, 90% applicable only to 256-color and 386 machines (10% of it covers theory that you can apply to anything as long as you have an optimized horizontal scanline drawer). However, Abrash's earlier book, Zen of Graphics, has a lot of the same material but definitely oriented towards EGA/286 and might be more useful to us. Unfortunately, I don't have Zen of Graphics in electronic form, only the real book, so I can't share it.
You're all insane and trying to steal my magic bag!
Trixter
 
Posts: 537
Joined: Mon Sep 01, 2008 12:00 am
Location: Illinois, USA

Re: Paku Paku -- 1.4 released 5 Mar 2011

Postby deathshadow60 » Sun Mar 13, 2011 7:54 pm

I like the online copy:
http://www.phatcode.net/res/224/files/html/index.html

But phatcode's got a whole directory full of useful information, in downloadable html/PDF and online HTML format.
The only thing about Adobe web development products that can be considered professional grade tools are the people promoting their use.
deathshadow60
 
Posts: 62
Joined: Mon Jan 10, 2011 6:17 am
Location: Keene, NH

Re: Paku Paku -- new DOS Game released

Postby Trixter » Tue Mar 15, 2011 10:33 am

deathshadow60 wrote:
Brutman wrote:For example, even though the 8088 only has an eight bit bus it is far quicker to use a 16 bit read or write than to use two 8 bit read or writes.

Which I'm using LODSW and MOVSW a lot. The problem is how the data is stored in this mode... Odd numbered bytes hav to remain 0xDD while even numbered bytes hold the two "pixels" as 4 bit packed... So word-sized operations can only do two pixels at a time even in a 4 bit per pixel mode. It's actually why the core of my 5x5 blit routine looks like this:

Code: Select all
   lodsw
   mov  bx,ax
   mov  ax,es:[di]
   and  al,bh
   or   al,bl
   stosw


I even unrolled the loops to squeeze a few extra cycles out of it. The sprite data format is stored as byteMask:byteData words which I point to with DS:SI for LODSW... which I then move to BX (which sucks, but is still faster than MOV reg16,mem; add SI,2) so I can use bh as the mask and bl as the data. Read in ES:DI, and, or and then write it out... Which is the process for every two pixels in the sprite. You'll notice I only operate on AL, since AH stores the $DD character value that has to be preserved.


Storing the sprite as lowbyte=color highbyte=mask is very clever! I hadn't thought of that, and will definitely use that technique when I flesh out the rest of my CGA library.

One of the things I love about coding for the 8088/8086 is that all timings and behavior are known. Like other old platforms (or embedded platforms), it truly is possible to write the "best" code for a particular situation -- no unpredictable caches or unknown CPUs screwing up your optimization. Whenever I see a bit of 808x assembly, I try to see if it can be reworked to be "best". So I thought it would be fun to try to optimize your sprite routine.

First, let's look at your original code, with timings and size:

Code: Select all
lodsw            16c 1b
mov  bx,ax       2c  2b
mov  ax,es:[di]  10c 3b
and  al,bh       3c  2b
or   al,bl       3c  2b
stosw            11c 1b
--------------------------
subtotal:        45c 11b
total cycles (4c per byte): 89 cycles


On 8088, reading a byte of memory takes 4 cycles, whether it's "MOV AX,mem" or the MOV AX opcodes themselves. That's why smaller slower code can sometimes win over larger faster code. So it's important to take the size of the code into account when optimizing for speed.

Like you, the mov bx,ax bugged me, so I thought about eliminating it. Because you do your drawing to an off-screen buffer in system RAM, and the buffer is smaller than the size of a segment, you have room left over in that segment. So if you store your sprites in that segment, we can get DS to point to both screen buffer and sprite data. Doing that lets us point BX to the offset where the sprite is (it was originally meant to be an index register after all), and use the unused DX register to hold the sprite/mask. We can then rewrite the unrolled inner loop to this:

Code: Select all
mov  dx,[bx]     8+5=13c 2b ;load sprite data/mask
lodsw            16c 1b ;load existing screen pixels
and  al,dh       3c  2b ;mask out sprite
or   al,dl       3c  2b ;or sprite data
stosw            11c 1b ;store modified screen pixels
inc  bx          3c  2b ;move to next sprite data grouping
--------------------------
subtotal:        49c b10
total cycles (4c per byte): 89 cycles


Although we saved a byte, it's a wash -- exactly the same number of cycles. But, since you're already unrolling the sprite loop for extra speed, we can change INC BX to just some fixed offset in the loop, like:

Code: Select all
mov dx,[bx+1]
...
mov dx,[bx+2]
...
mov dx,[bx+3]


Which means the inner loop is now:

Code: Select all
mov  dx,[bx+NUM] 8+9=17c 3b ; "NUM" being the offset in the loop
lodsw            16c 1b
and  al,dh       3c  2b
or   al,dl       3c  2b
stosw            11c 1b
--------------------------
subtotal:        50c 9b
total cycles (4c per byte): 86 cycles


Hey, we saved three cycles over the original (and, as a nice side effect, two bytes), successfully squeezing blood from a stone. Awesome.
You're all insane and trying to steal my magic bag!
Trixter
 
Posts: 537
Joined: Mon Sep 01, 2008 12:00 am
Location: Illinois, USA

Re: Paku Paku -- 1.4 released 5 Mar 2011

Postby deathshadow60 » Wed Mar 16, 2011 4:07 am

This:
Code: Select all
mov  dx,[bx+NUM] 8+9=17c 3b ; "NUM" being the offset in the loop
lodsw            16c 1b
and  al,dh       3c  2b
or   al,dl       3c  2b
stosw            11c 1b

doesn't make any sense... or I'm not following how that would work at all... Cloning DI to BX seems kinda pointless and a disp+offset is 9 while as just the disp is 5... doesn't it make more sense that since we have es:[di] already pointing at it to USE ES:[di]?

Also, for 8088 on a word operation isn't that mov dx,[base+offset] 12 clocks plus 9 for the EA, for 21 clocks total, not your 17? That puts 4 more clocks onto yours making it LESS efficient.

12+7ea (5 index with 2 segment override) vs. 12+9ea (disp+base)

While calling lodsw after the mov might SEEM to make sense
Code: Select all
mov  bx,es:[di]  12+7ea
lodsw            16
and  al,bh       3
or   al,bl       3
stosw            11


In practice your version and this new one I just listed are actually SLOWER because the memory moves are optimized for AX, and the prefetch buffer while processing the second byte of the MOV prefetches both bytes of the next opcode. LODSW may be a memory operation, but it does the inc AFTER the memory access, leaving the prefetch buffer free to populate with the next opcode. It's actually faster to do the LODSW first!

Which is why your calculations are off for total cycles on 8088 because you didn't figure the prefetch buffer into it. Long operations like lodsw have internal clock waits that allow the word-wide prefetch buffer to grab the next bytes making things like the short moves after take less time. You are basically saying that a 2 byte operation that takes 3 clocks internally actually takes 11 clocks -- which is total nonsense!

I think you missed that part of the 8088 -- the 32 bit wide instruction prefetch buffer that turns execution cycles in opcodes into memory reads for the next instruction. (48 byte wide 8086/V20)
The only thing about Adobe web development products that can be considered professional grade tools are the people promoting their use.
deathshadow60
 
Posts: 62
Joined: Mon Jan 10, 2011 6:17 am
Location: Keene, NH

Re: Paku Paku -- 1.4 released 5 Mar 2011

Postby Brutman » Wed Mar 16, 2011 6:11 am

I'm sure that Jim knows the problem with the prefetch buffer. Which is part of the reason why the V20 is such a nice upgrade. Besides a few extra instructions it has the wider prefetch buffer, so the machine is idle less often. It's really shockingly bad that if the 8086 was relatively 'balanced' in terms of pre-fetch and execute that Intel cut the bus width by half and did not make adjustments to the prefetch buffer.

I don't try to get into the cycle counting as much as you guys do because it just is so hard to predict when you take the entire system into account. Especially on a PCjr - the first 128K is significantly slower because of the video refresh and expansion memory probably has dedicated DRAM controllers on it to control refresh instead of using the DMA channel, so in theory it might be faster than a stock 5150.

My general rules of thumb are to avoid branching, use compact opcodes that do a lot (string ops), and verify everything by actually timing the execution of the code. I also do most of my work in C because of the sheer volume of code, so I spend a lot of time looking at what the compiler did to see if I can 'help' it, either by changing the code slightly or just dropping into ASM.
Brutman
Site Admin
 
Posts: 970
Joined: Sat Jun 21, 2008 5:03 pm

Re: Paku Paku -- 1.4 released 5 Mar 2011

Postby deathshadow60 » Wed Mar 16, 2011 6:30 am

Brutman wrote:and verify everything by actually timing the execution of the code.

TRUTH there. Switching around the order in which things are done and switching general register commands to use the accumulator where possible, even with the MOV and extra bytes often turns up faster on a real 8088 when the 'math' of opcode execution and memory read times says otherwise... I wouldn't even have noticed it testing in DosBox, but running a fixed loop of blitting for 2 seconds and seeing how many I could do flat-out showed it.

Also, something simple I had drilled into me programming the 8088 -- Long-short-long-short. Alternate opcodes with long exec times with short ones, so as to best leverage the prefetch. Even long -> medium -> short works -- and where possible byte sized followed by long execs... which is why the LODSB followed by the mov reg,reg followed by the mov reg,mem followed by the AND works better.
The only thing about Adobe web development products that can be considered professional grade tools are the people promoting their use.
deathshadow60
 
Posts: 62
Joined: Mon Jan 10, 2011 6:17 am
Location: Keene, NH

Re: Paku Paku -- 1.4 released 5 Mar 2011

Postby Trixter » Wed Mar 16, 2011 8:56 am

I usually leave reordering of code and prefetch queue considerations for the final phase of optimization (writing the smallest/"best" code and then actually timing it) because I find that I overestimate how much it will help me. It's only 4 bytes long on 8088, and can usually only 1 or 2 instructions on average.

A friend pointed out to me that two of my timings are wrong, which means some of the information and assumptions I posted is incorrect.

I am going to revise my post after I perform some real timings with the 8253 and get some real cycle counts from the real hardware. I'll post my findings.
You're all insane and trying to steal my magic bag!
Trixter
 
Posts: 537
Joined: Mon Sep 01, 2008 12:00 am
Location: Illinois, USA

Re: Paku Paku -- 1.4 released 5 Mar 2011

Postby Trixter » Wed Mar 16, 2011 8:22 pm

Revised information, with corrections to my cycle timings as well as a solution that is even faster than mine, is now here: http://trixter.oldskool.org/2011/03/15/ ... -climbing/

As promised, I'll publish actual timings for all three blocks of code, but that might have to wait until tomorrow as I'm almost out of time tonight.

I am aware that DMA refresh cycles, wait states reading/writing display memory, and background interrupts can all alter timing of code. But the same general principles still apply. Usually you start worrying about those things only after you have two or three blocks of code that have the same cycles and sizes and don't know which one is faster in practice.
You're all insane and trying to steal my magic bag!
Trixter
 
Posts: 537
Joined: Mon Sep 01, 2008 12:00 am
Location: Illinois, USA

Re: Paku Paku -- 1.5 released 21 Mar 2011

Postby deathshadow60 » Mon Mar 21, 2011 12:08 pm

As mentioned in my edit of the original post, here's release 1.5 -- which should clear up PcJr. issues, and if not this version comes with the pakut1Jr.exe which tries to use the 160x200 graphics mode. The normal paku.exe is the preferred method of running it, but for Jr. or Tandy owners who do see problems with the video I've provided that as a way to at least try it.

As discussed over on tricksters blog, we've got some code improvements, the biggest of which being the backbuffer blit routine is now:

Code: Select all
   mov  cx,5
   mov  bx,$004D

{ I'm NOT unrolling this one! }

@loop:

   lodsw
   and  ah,es:[di]
   or   al,ah
   stosb

   lodsw
   and  ah,es:[di]
   or   al,ah
   stosb

   lodsw
   and  ah,es:[di]
   or   al,ah
   stosb

   add  di,bx

   loop @loop


Yer not gonna get much cleaner than that. Likewise the screen copy from the backbuffer reads:

Code: Select all
   mov  cx,8
   mov  bx,$0099
   mov  ax,$004C

@loop:
   movsb
   inc  di
   movsb
   inc  di
   movsb
   inc  di
   movsb
   add  si,ax
   add  di,bx
   loop @loop


Which is about as good as that's gonna get too.

Hopefully this version will clear up the final Jr. issues and wring as much speed as is practical to try for out of the 8088. I could change the sprite format to make backbuffer blits even faster, but I'm going to save that for the next game as this one is now "fast enough".
The only thing about Adobe web development products that can be considered professional grade tools are the people promoting their use.
deathshadow60
 
Posts: 62
Joined: Mon Jan 10, 2011 6:17 am
Location: Keene, NH

Re: Paku Paku -- 1.5 released 21 Mar 2011

Postby Vorticon » Sat Apr 09, 2011 5:13 pm

Has anybody else tried version 1.5? The sprites for Pacman and some of the fruits seem corrupted...
Vorticon
 
Posts: 276
Joined: Fri Nov 27, 2009 7:25 am

PreviousNext

Return to Programming

Who is online

Users browsing this forum: No registered users and 3 guests

cron