Viktor T. Toth - W - A brief tutorial

Support my research and science communication efforts through Patreon.

W is a programming language specifically developed for the purpose of writing small, lightweight MS-DOS executable programs (.COM programs.)

The Basics

The W programming language is a C-like programming language. A W program consists of a series of declarative statements that may define program functions or global data.

W is a keyword-less language. Every symbol used in a W program must be defined. A symbol is a combination of letters, numbers, and the underscore character (_), with only two restrictions: symbols cannot begin with a number, and cannot be more than 254 characters in length.

Regardless of how a symbol is defined, it essentially represents a 2-byte word. If the symbol was defined as the name of a function, it represents the machine address of the first instruction of that function. If the symbol was defined as the name of a variable or array, it represents the machine address of the variable, or the first element of the array.

A W program consists of one or more declarative statements, each in the following form:

symbol := something

Two variants of this syntax allow the declaration of a program function or an array:

symbol(argument-list) := something
symbol[array-size] := something

Each W program, in order to run, must have a function defined with a single underscore as its name:

_() := something

The right-hand side of a declarative statement is an expression; usually, a compound expression in the case of a function declaration, a constant expression in the case of a global variable, or a list of constant expressions in the case of arrays. For instance:

a[2] := 1, 2

x := 5

multiply(p, q) :=
{
	p * q
}

A compound expression is, of course, a series of expressions enclosed in curly braces. A distinguishing feature of W is that every expression has a value. In the case of a compound expression, this value is the value of the last expression within the compound expression.

A function in W always has a (2-byte) return value. This return value is nothing else but the value of the expression on the right hand side of the function declaration. In the example above, the value of the multiply function is, therefore, p*q.

An expression that is not a compound expression is usually a mathematical expression that consists of symbols, unary, and binary operators. For instance:

p * q
a / -b
(x + y) * (x - y)

A compound expression may also contain declarations. These declarations define local variables and arrays (declaring a function within a compound expression is not permitted.) For instance, the following function computes the cube of the sum of its two parameters using a local variable:

cubesum(a, b) :=
{
	d := a + b
	d * d * d
}

It is also possible to assign a value to a symbol that has been defined previously. For instance:

x := a + b
y := a - b
a = x * y
b = x / y

The scope of a symbol is local to the compound expression in which it is defined. For instance:

{
	a := 2
	{
		b := 2
	}
	a = a * b  ; error! b not defined here
}

Lastly, if you wish to declare a variable or array without initializing it to a known value, you can use a question mark (yes, BCPL veterans, that's where it is coming from):

x := ?
a[10] := ?

Addresses and Pointers

Like C, W makes use of the powerful concept of a pointer. In fact, as mentioned earlier, every W symbol is in fact a pointer, an address in memory; however, in most expressions, symbols are automatically dereferenced. For instance, the expression a*b means the product of the values of a and b, not the product of these two variables' addresses.

It is, however, possible to explicitly refer to a variable's address using the pound sign: #a. Like in C, this is most useful when passing parameters to a function. Consider, for instance, the following example:

squarethis(a) :=
{
	a = a * a
}

What do you think will happen if you call this function from somewhere else like this:

squarethis(q)

If your answer is that the call will alter the value of q, you are mistaken. That is because squarethis receives a copy of q, a copy that will in fact be discarded when squarethis terminates. The original value of q will remain unchanged. It is, however, possible to rewrite this example as follows:

squarethis(p) :=
{
	@p = @p * @p
}

...

squarethis(#q)

In this example, squarethis is called not with the value of q, but its address. Inside squarethis, we use the dereference operator, @, to refer to the original value.

To better clarify the meaning of the # and @ operators and the role of addresses and pointers in W, let's take a numeric example; let's say that the symbol a is in fact a variable at address 1000 in memory, and the variable's value is 1234. The meaning of the various constructs is then as follows:

a the value at address 1000, i.e., 1234

#a the value of 1000

@a the value at the address stored at 1000; i.e., the value at address 1234

a() a call to program code at address 1000 (unless the value of 1234 is meaningful as a machine language instruction, this call will likely produce garbage.)

a[0] assuming that a refers to an array, this is the value of the first array element at address 1000, i.e., 1234

a[1] the value of the second array element, i.e., whatever value is found at address 1002

Forward References

The W compiler is a single-pass compiler that requires that a symbol be defined before it can be used. Therefore, the following example is illegal:

a :=
{
	...
	b()
	...
}

b :=
{
	...
	a()
	...
}

It is, however, possible to use constructs similar to these using helper variables:

b := ?

a :=
{
	...
	@b()
	...
}

_b :=
{
	...
	a()
	...
}
...
_() :=
{
	b = #_b
	...
}

Since b is defined before a(), it can be used in a. The variable b will be assigned, as its value, the address of the function _b. When the function is called in a(), the @ dereference operator is used, because we don't want to call code at the location of b, where an address is stored; we want to call code at the address that the value of b points to.

Conditional Expressions

The real power of a computer is due to its ability to choose alternate paths of execution based on the results of prior computations. Languages like C allow complex control structures through keywords like if, while, switch, and more. In W, conditional execution is accomplished using the conditional operator, ?. The basic syntax of a conditional expression is as follows:

expression-1 ? expression-2, expression-3

When a conditional expression is encountered, first expression-1 is evaluated. If its value is non-zero, expression-2 is evaluted afterwards, and expression-3 is skipped altogether. If the evaluation of expression-1 produces a null result, expression-2 is skipped, and expression-3 is evaluated instead. The comma and expression-3 are optional and can be omitted.

For instance, the following function computes the maximum of two values:

max(a, b) := a > b ? a, b

This is as good a place as any to remark on the fact that W does not require instructions to be terminated by a special character (like the semicolon in C). This can lead to some unexpected results. For instance, consider the following program fragment:

b = a
(x > y) ? d = c

You might think that this program fragment will first assign the value of a to b, and then evalute the expression x > y. That, however, is not the case. If you rewrite this expression on a single line, you might see why:

b = a(x > y) ? d = c

In other words, first x > y will be evaluated. The result will be used as a parameter to a call to a function at address a; the return value of this function will be assigned to b, and will also be used as a condition for evaluating the assignment d = c. Unless a actually is the name of a function, chances are that this expression will cause the program to crash.

The Instruction Pointer

Conditional expressions alone are not sufficient to implement all forms of control structures; for instance, they do not allow for loop constructs. To eliminate this shortcoming without resorting to keywords like for or goto, W introduces the concept of the instruction pointer. The instruction pointer represents the address of the next instruction line in the program.

This is best demonstrated through some simple examples. For instance, the following lines can be used to compute the sum of integers between 1 and 10:

s := 0
i := 1
p := $
i <= 10 ?
{
	s = s + i
	i = i + 1
	$ = p
}

Careless use of the instruction pointer can cause problems. If you leave a compound instruction by way of a $= assignment, variables local to that block will not be deallocated, and all sorts of evil things are likely to occur. For instance, the following program may not perform as expected:

s := 0
i := 1
p := $
i <= 10 ?
{
	i2 := i * i  ; bad idea!
	s = s + i2
	i = i + 1
	$ = p
}

Every time the inner block is entered, space is allocated for the variable i2. However, when the $=p instruction is encountered, execution resumes outside of the block, but i2 will never be deallocated. The next time the block is entered, memory is allocated again for i2. Eventually, this may lead to exhaustion of available memory, or it may cause the program to simply get lost and crash while referencing an invalid memory location.

The solution is to make sure that there are no $= assignments within any block in which variables are allocated:

s := 0
i := 1
p := $
i <= 10 ?
{
	{
		i2 := i * i
		s = s + i2
	}
	i = i + 1
	$ = p
}

In this program, i2 is allocated and deallocated safely before execution is transferred to the beginning of the loop by the $= instruction.

Constant Expressions

Throughout the examples presented on this page, decimal constants have been used as initializers or in expressions. The W language also permits the use of other types of constants. Hexadecimal constants begin with 0x, followed by up to four hexadecimal digits; for instance, 0xA000. Character constants are letters enclosed in single quotes, and are resolved into the ASCII value of the character: 'A', for instance, is the same as writing 65. Lastly, string constants are enclosed in double quotes; the string is stored in memory along with the program, and the string constant references the string's address.

Character and string constants can both contain escape sequences that represent special, non-printable characters. The following escape sequences are recognized:

\0: the null character (ASCII 0)
\t: the tab character (ASCII 9)
\n: the newline character (ASCII 10)
\r: the carriage return character (ASCII 13)
\xNN: any character represented by the ASCII code NN, where NN are two hexadecimal digits.

Note that unlike C, W does not automatically append a terminating null character to string constants. This must be explicitly added to the string.

Machine Language Subroutines

It is possible to include machine language subroutines in a W program, if the subroutine is encoded numerically. (You would normally use a line assembler, like the MS-DOS debug command, to encode the lines.) Care should be taken to ensure that bytes are encoded in the proper order; since the Intel architecture is little-endian, the least significant byte of a two-byte word will precede the most significant byte in memory.

A machine language subroutine can reference parameters passed to the function. Such a machine language subroutine would usually begin with the following preamble:

PUSH BP
MOV  BP, SP

Subsequently, parameters can be referenced using [BP+4], [BP+6], etc. [BP+4] references the rightmost parameter passed to the subroutine, [BP+6] references the second parameter from the right, etc.

A machine language subroutine must not alter, or must restore, the values in the four segment registers, the BP, and the DI register before executing a RET instruction to return to the calling program.

As an example, consider a machine language subroutine that uses the REPNZ SCASB instruction to compute the length of a string. The pointer to the string is passed to the subroutine as its single parameter. The computed length is left in the AX register, which will be treated as the subroutine's return value. This subroutine can be implemented using Intel machine language as follows:

55            PUSH    BP
89E5          MOV     BP,SP
57            PUSH    DI
8B7E04        MOV     DI,[BP+04]
B9FFFF        MOV     CX,FFFF
FC            CLD
30C0          XOR     AL,AL
F2            REPNZ
AE            SCASB
B8FEFF        MOV     AX,FFFE
29C8          SUB     AX,CX
5F            POP     DI
5D            POP     BP
C3            RET
90            NOP

The NOP instruction at the end is simply there to ensure that the function consists of an even number of bytes, and can be encoded as a sequence of two-byte words. The entire function can then be encoded in W as follows:

strlen := 0x8955, 0x57E5, 0x7E8B, 0xB904, 0xFFFF, 0x30FC, 0xF2C0, 0xB8AE,
	  0xFFFE, 0xC829, 0x5D5F, 0x90C3

A call to strlen will work as you would expect. For instance, the expression strlen("ABCD\0") will yield a value of 4.

Program Arguments

The main function of a W program, _(), can have a single, optional argument. If it is present, it is initialized as a pointer to a null-terminated string that contains any arguments used when invoking the program from the command line.

Some Examples

The first "real life" programming example is a small W program that computes the factorial of a numeric value that is passed to the program on the command line. Here is a usage example:

C:\>fact 5
The factorial of 5 is 120.

The complete program contains several library functions; these are, however, omitted here for brevity. What remains is a few lines of program code:

factorial(n) := n>1 ? n*factorial(n-1), 1

_(arg) :=
{
    n := atoi(arg)
    printf(factorial(n), n, "factorial\0", "The %s of %d is %d.\r\n\0", stdout)
}

The definition of the factorial is a recursive definition. The printf and atoi functions, as well as the stdout constant, are part of a small library of functions common to most W programs that I wrote.

The second example is a bit more complex; it is used to compute the prime factors of an integer:

C:\>prime 65430
2 * 3^2 * 5 * 727

This program demonstrates the use of multiple looping constructs, conditionals, and the use of an uninitialized array as a placeholder for character operations.

_(arg) :=
{
	nVal := atoi(arg)
	nFct := 2
	s[3] := ?
	l := ?
	nPow := ?

	p := $
	{
		nPow = 0
		q := $
		nVal % nFct == 0 ?
		{
			nPow = nPow + 1
			nVal = nVal / nFct
			$ = q
		},
		{
			nPow ?
			{
				l = itoa(nFct, #s)
				write(stdout, #s, l)
				nPow > 1 ?
				{
					write(stdout, "^", 1)
					l = itoa(nPow, #s)
					write(stdout, #s, l)
				}
				nVal > 1 ? write(stdout, " * ", 3)
			}
		}
		nFct = nFct + (nFct > 2 ? 2, 1)
	}
	nFct * nFct < nVal ? $ = p

	nVal > 1 ?
	{
		l = itoa(nVal, #s)
		write(stdout, #s, l)
	}
	write(stdout, "\r\n", 2)
}

Nav view search

Navigation

Search

W - A brief tutorial