Note: This is the June 17, 2006 capture of http://bentong.topcities.com/comp/progs/c/emit.htm from the Wayback Machine. This was uploaded to the reloaded ISLESV.NET on June 22, 2023.

Turbo/Borland C/C++ comes with a curious “miscellaneous” function called __emit__(). According to Herbert Schildt’s Turbo C/C++: The Complete Reference, “The __emit__() function is used to insert one or more values directly into the executable code of your program at the point at which __emit__() is called. These values will be 8086 (family) machine instructions. … You must be an expert 8086 assembly language programmer to use __emit__(). If you insert incorrect values, your program will crash.” (Osborne McGraw-Hill, 1992)

You pass to the function either byte (8-bit) or word (16-bit) values. You can pass more than one argument in each call, but a 16-bit value which can fit in a byte will be reduced to a byte. For example, you pass 0x0000, intending a word value (two bytes) which is zero. This is reduced to a single zero byte instead. If you continually use __emit__() in your C/C++ programs (not a good idea by the way), this can be a source of very subtle bugs. The value must also be known at compile-time; this means that you can’t pass variables as arguments to __emit__() — they must all be constants. To use the function, add an include directive for dos.h.

I for myself cannot find a good use for __emit__(). Why would anyone want to insert explicit values in the binary code? Perhaps the nearest to “good use” is to put breakpoint interrupts in the file (the opcode is 0xCC), so that you will be returned to the debugger at the point reached. But with the integrated debugger, you can set interrupts more easily.

So why is __emit__() available? Because C was first used in conjunction with assembly language. C is the closest to assembly among high-level languages. If you are an expert assembly language programmer, you may want to experiment with __emit__().

A C Program Written Entirely with __emit__() Statements

To see the power of __emit__(), we present a simple program that uses solely __emit__() statements to output to the screen something meaningful (the string “This program is written entirely with __emit__() statements.“).

WARNING: You can compile the example program using any C/C++ compiler that supports the __emit__() macro, but you must be an expert assembly language programmer, or at least have some experience writing assembly programs, if you want to experiment with the code. It is easy to do something wrong and end up with a nasty event when writing assembly programs. With __emit__(), you must not only think how to express the program flow in assembly — sometimes you must also translate manually those statements into explicit hexadecimal codes.

Setting DS Equal to CS

The program is compiled using the SMALL memory model. With memory models aside from TINY, the code and data segments are separate from one another. This means that at the start of the execution of the program, conceptually as we enter main() (not actually the real thing, but it approximates), the CS and DS registers will have different values. Since we output the string in the code segment, we must get the CS value and put it in the DS register such that our code and data segments will be the same. This is in preparation for INT 21H, Service 09H, the Print String Service, which we will use to output the string. This interrupt requires that the following registers will be set to the corresponding values:

AH = 09H
DS:DX = pointer to the character string

So in assembly language we will say:

PUSH	CS	; put the value of the CS register to the stack
POP	DS	; and pop the value to the DS register

After this two instructions CS=DS and we can take care of DX next.

Setting DX to Point to the Start of the Character String

The string comes after the last instruction which exits our program. Since we have set DS=CS, we can find the offset of the string in the segment, if only we know where we are currently in the segment. That is, if we know where we are with respect to the start of the segment, then we will know where the character string is with respect to the start of the segment, because we know how far we are from the string.

You might ask, “But are we not at the start of the segment? We are just starting the program. This is our first instruction inside main().” Unfortunately, main() is not the actual entry point of a C program in machine terms. There is a set of instructions which sets up the CS and DS registers, the stack, and other necessary things before it even calls main(). (That’s right — main() is called. Ever wonder why you can end main() with a return statement? If you want to see the source for the STARTUP code that calls main(), it is available as C0.ASM in your Borland C++ 4.5 installation (I don’t if it comes with the other versions), in the directory LIB\STARTUP.)

Even if you look at the source code for the startup routine, you would have a program figuring its size. Therefore it is easier to just find a way to know where we are in the code. Remember that the IP register always points to the next instruction. If we could only push the value of the IP register into the stack, and pop it to DX… Unfortunately, there is no PUSH IP instruction in the 8086 family. You have to do it the other way.

When you do a CALL however, the IP is pushed in the stack. This is functionally equivalent to PUSH IP. Therefore, utilizing a technique common to DOS viruses, we call the next instruction:

	CALL	NEXT_INSTRUCTION
NEXT_INSTRUCTION:
	POP	BP

Now BP has the value of IP. We can then move the value at BP to DX. The last thing to do would be to add to this value the offset from the label NEXT_INSTRUCTION to the start of the character string. If you know the length of each instruction, you could simply sum them up. In our case, we counted 15 bytes, so we add 15 to DX.

Calling Interrupts

The next statements are pretty self-explanatory:

	MOV	AH,09H	; Print String Service
	INT	21H	; DOS call
	MOV	AH,4CH	; Terminate Program
	INT	21H	; DOS call
	DB	'This program is written entirely with __emit__()'
	DB	' statements.$'

Note that the dollar sign character is needed by DOS’ Print String Service to know where the end of the string.

The Complete Program in Assembly Language

Our entire program, which first sets up the DS register to equal the CS register, then gets the current execution point and adjust it to the start of the character string, would look like the following in assembly language:

	PUSH 	CS
	POP	DS	; DS is now = CS
	CALL NEXT_INSTRUCTION
NEXT_INSTRUCTION:
	POP	BP
	MOV	DX,BP
	ADD	DX,0FH
	MOV	AH,09H	; Print String Service
	INT	21H	; DOS call
	MOV	AH,4CH	; Terminate Program
	INT	21H	; DOS call
	DB	'This program is written entirely with __emit__()'
	DB	' statements.$'

Note that I did not actually pass this through an assembler; if you want to, you need to add the necessary headers and footers needed by your assembler. Then you can link it and produce a .COM file.

Three Ways to Do It

If you assembled the preceding program (barring any errors), the resulting COM file could easily have been passed through a C program that simply maps the individual bytes to a format suitable as input to __emit__(). The central line of the said program will simply be something like this:

    while((c=getc(infile))!=EOF)
fprintf(outfile, ", '0%2x'", c);

where infile and outfile will be FILE * to the corresponding files, is a declared variable, and the necessary headers have already been output, and the necessary footers to be output next. The output file is assumed to be legal C program composed only of __emit__() statements.

The first approach is obviously the easiest way out, and do not make you concerned of the opcodes that really represent those mnemonics. While these approach is fine, if you are using __emit__() statements in your C code at all, that means you will soon hate this approach.

There are two remaining ways how to produce your own hexadecimal codes. The first involves using DEBUG, a primitive debugger program that comes with DOS and Windows. You use its assemble command, and input the one line you want to translate into hex.

The last way is the most difficult, or should I say challenging. You learn the opcodes yourself and translate the assembly instruction you want to binary using your own head. The best resource you could have if you want to produce the binary code yourself and not pass the attempted assembly program through a legitimate assembler is a thick book on 8086 assembly language programming, which should contain a listing of the mnemonics and their corresponding opcodes.

If you want this approach, I should say you have every reason to like __emit__(). This is the approach we use.

The Source

Here is the source to a sample program written entirely in C. It simply prints out the string “This sample program is written entirely with __emit__() statements.” and then quits.

/*
	EMIT.C	Copyright (C) 2004 by Vincent "Bentong" Isles
		http://bentong.topcities.com
		bentong_isles@yahoo.com
*/

#include <dos.h>	/* __emit__() declared here */

main()
{
	__emit__(0x0e);			/* PUSH CS */
	__emit__(0x1f);			/* POP DS */
	__emit__(0xe8, 0x00, 0x00);	/* call the next instruction */
	__emit__(0x5d);			/* pop bp -- old trick */
	__emit__(0x89, 0xea);		/* mov dx, bp :-) */
	__emit__(0x81, 0xc2);		/* add dx, */
	__emit__(0x0f, 0x00);		/* 0x0f bytes away */
	__emit__(0xb4, 0x09);		/* MOV AH,9 */
	__emit__(0xcd, 0x21);		/* INT 21H */
	__emit__(0xb4, 0x4c);		/* MOV AH,4CH */
	__emit__(0xcd, 0x21);		/* CD 21 */
	__emit__('T', 'h', 'i', 's', ' ', 'p', 'r', 'o', 'g', 'r', 'a', 'm');
	__emit__(' ', 'i', 's', ' ', 'w', 'r', 'i', 't', 't', 'e', 'n', ' ');
	__emit__('e', 'n', 't', 'i', 'r', 'e', 'l', 'y', ' ', 'w', 'i', 't');
	__emit__('h', ' ', '_', '_', 'e', 'm', 'i', 't', '_', '_', ' ', 's');
	__emit__('t', 'a', 't', 'e', 'm', 'e', 'n', 't', 's', '.', '$');
}

Notes

As I was writing the very first emit program I immediately used the viral technique of POPping the IP from the stack into the BP register. Actually, you can shorten the program by directly POPping the IP to the DX:

	__emit__(0x5a);		/* pop dx */
	__emit__(0x81, 0xc2);	/* add dx, */
	__emit__(0x0d, 0x00);	/* two bytes smaller */

instead.

You must treat each character of the string as a single entity; remember that __emit__() will only accept byte or word values.

Leave a Reply

Your email address will not be published. Required fields are marked *