deathshadow60 wrote:Brutman wrote:For example, even though the 8088 only has an eight bit bus it is far quicker to use a 16 bit read or write than to use two 8 bit read or writes.
Which I'm using LODSW and MOVSW a lot. The problem is how the data is stored in this mode... Odd numbered bytes hav to remain 0xDD while even numbered bytes hold the two "pixels" as 4 bit packed... So word-sized operations can only do two pixels at a time even in a 4 bit per pixel mode. It's actually why the core of my 5x5 blit routine looks like this:
- Code: Select all
lodsw
mov bx,ax
mov ax,es:[di]
and al,bh
or al,bl
stosw
I even unrolled the loops to squeeze a few extra cycles out of it. The sprite data format is stored as byteMask:byteData words which I point to with DS:SI for LODSW... which I then move to BX (which sucks, but is still faster than MOV reg16,mem; add SI,2) so I can use bh as the mask and bl as the data. Read in ES:DI, and, or and then write it out... Which is the process for every two pixels in the sprite. You'll notice I only operate on AL, since AH stores the $DD character value that has to be preserved.
Storing the sprite as lowbyte=color highbyte=mask is very clever! I hadn't thought of that, and will definitely use that technique when I flesh out the rest of my CGA library.
One of the things I love about coding for the 8088/8086 is that all timings and behavior are known. Like other old platforms (or embedded platforms), it truly is possible to write the "best" code for a particular situation -- no unpredictable caches or unknown CPUs screwing up your optimization. Whenever I see a bit of 808x assembly, I try to see if it can be reworked to be "best". So I thought it would be fun to try to optimize your sprite routine.
First, let's look at your original code, with timings and size:
- Code: Select all
lodsw 16c 1b
mov bx,ax 2c 2b
mov ax,es:[di] 10c 3b
and al,bh 3c 2b
or al,bl 3c 2b
stosw 11c 1b
--------------------------
subtotal: 45c 11b
total cycles (4c per byte): 89 cycles
On 8088, reading a byte of memory takes 4 cycles, whether it's "MOV AX,mem" or the MOV AX opcodes themselves. That's why smaller slower code can sometimes win over larger faster code. So it's important to take the size of the code into account when optimizing for speed.
Like you, the mov bx,ax bugged me, so I thought about eliminating it. Because you do your drawing to an off-screen buffer in system RAM, and the buffer is smaller than the size of a segment, you have room left over in that segment. So if you store your sprites in that segment, we can get DS to point to both screen buffer and sprite data. Doing that lets us point BX to the offset where the sprite is (it was originally meant to be an index register after all), and use the unused DX register to hold the sprite/mask. We can then rewrite the unrolled inner loop to this:
- Code: Select all
mov dx,[bx] 8+5=13c 2b ;load sprite data/mask
lodsw 16c 1b ;load existing screen pixels
and al,dh 3c 2b ;mask out sprite
or al,dl 3c 2b ;or sprite data
stosw 11c 1b ;store modified screen pixels
inc bx 3c 2b ;move to next sprite data grouping
--------------------------
subtotal: 49c b10
total cycles (4c per byte): 89 cycles
Although we saved a byte, it's a wash -- exactly the same number of cycles. But, since you're already unrolling the sprite loop for extra speed, we can change INC BX to just some fixed offset in the loop, like:
- Code: Select all
mov dx,[bx+1]
...
mov dx,[bx+2]
...
mov dx,[bx+3]
Which means the inner loop is now:
- Code: Select all
mov dx,[bx+NUM] 8+9=17c 3b ; "NUM" being the offset in the loop
lodsw 16c 1b
and al,dh 3c 2b
or al,dl 3c 2b
stosw 11c 1b
--------------------------
subtotal: 50c 9b
total cycles (4c per byte): 86 cycles
Hey, we saved three cycles over the original (and, as a nice side effect, two bytes), successfully squeezing blood from a stone. Awesome.