Computer Languages and Entry Points

Every computer language has what I would call an “entry point”: a set of things which you need to ‘grok’ in order to understand that computer language. They’re fundamental assumptions made about the shape of the universe which is just “off” enough that you’ll forever be confused if you don’t get them. (Or at least they’re things that confused me until I got them.)

Here are a few languages I spent time learning and the ‘pivot points’ around those languages.

LISP
* It’s all about recursion. Learn to live by, and love, recursion.
* Because there is no real natural ‘syntax’ per se, pick a ‘pretty print’ style and stick with it.
* There is no such thing as a single “program”; applications just sort of ‘jell’. This is unlike C or Pascal programs, which definitely have a beginning, middle, and end to development. (This fact made me wonder why, besides Paul Graham, no-one used LISP for web development.)

Pascal and C
* The heap is your friend. Understand it, appreciate it, love it.
* The heap is your enemy; it is a mechanism designed to kill your program at every opportunity if given a chance.

C++
* Objects are your friends. Instead of telling the program the steps to accomplish something, you describe the objects which are doing things. To someone who is used to thinking procedurally, this is a really fantastic brain fuck.
* While a->foo() looks like a pointer to a method, it is really a reference to a vtable which happens to hold a pointer to a method with an invisible argument. If you’re used to thinking about C as a glorified portable assembly language (and can even remember which form of a++ for a pointer a translates into a single assembly instruction on a 680×0 processor), vtables help bridge the gap from “talking to bare metal” to “abstract object oriented programming.”
* Resource Acquisition Is Initialization.

Java
* The package hierarchy is like a parallel “code-space” file system.
* When learning Java, also learn the run-time library. Love it or hate it, but you should first learn the java.lang package, followed by java.util, java.io, and then java.net. All the rest is icing.

Objective C
* Learn to love the Smalltalk-like syntax. For someone who cut their eye-teeth on LISP, C++ and Java, the Smalltalk-like syntax leaves you wondering what to call a method; after all, every language except Smalltalk has just one name for a function and like any good language based on math, the name prefixes the parameters. Not Smalltalk, nor Objective-C.
* If you cut your eye-teeth on the rules for addRef()/release() in Microsoft’s COM environment, the retain/release rules are fucking odd. With Microsoft, the rule was simple: if something hands you a pointer to an object, you own it; you release it. If you pass a pointer to an object, you must increment the reference count on the object because you pass the object means you’re passing ownership.

Not Objective C, which in adding ‘autorelease’ has made the rules a little bit more complicated. Now the rules seem to be:

(1) If you are handed a pointer to an object, you must explicitly retain it if you are holding it.
(2) If you pass an object to another object, it will retain it, so you need to do nothing.
(3) If you create an object out of thin air, you own it–and if you don’t need to hold it, you must autorelease it so it gets released.

Now (3) gets real tricky: basically, there is a naming convention which tells you if you ‘created an object out of thin air.’ Basically, if the name of the method you called starts with ‘alloc’ or ‘new’ or contains the word ‘copy’, you own it, and you have to let it go with autorelease. (Why not ‘release’ instead? Because ‘release’ causes it to go away now, while autorelease causes it to go away “eventually.”)

I long for the simplicity of COM, even though it creates additional verbage in your code. Though I’d much rather have the garbage collection of Java or LISP: just drop the damned thing, and it’ll go away on its own.

Multi-threaded programming

While not strictly a language, it does impose its own rules on the universe, and there are a few key things to ‘grok’ to get it:

* There really are only a few “safe” design models for building multi-threaded code. You have thread pools, single-threaded event pumps, and re-entrant data structures (which can be as simple as using a semaphore to grant exclusive access to as complicated as using read/write locks on a complex data structure)–and that’s it. In other words, you either have units of work which can be isolated into a single object and worked in parallel (because they’re unrelated)–good for a thread pool–or you have a process (such as a user interface) where you shove all your parallelism into a queue which is operated on in a single thread (as is done in most UI libraries), or you have a specialized case where you have a re-entrant data structure which was designed from the ground up to be multi-thread ready.

And that’s it.

Anything else is a bug waiting to be discovered.

* If you need to create a re-entrant data structure that is more complicated than a semaphore-protected object, you really really need to remember to avoid deadlocks by making sure you use a consistent locking order.

* Oh, and it is quite helpful to create a semaphore object or a locking object wrapper, for testing purposes, which times out (and fails noisily) after some set period, such as five minutes. Preferably blowing up by killing all threads waiting on this object with a complete stack trace.

Cheap Two Factor Authentication

Security authentication relies upon three factors: what you know, what you are, and what you have.

What you know: the canonical example of this is a password. It’s something you know: something you’ve memorized and, when asked, you can repeat it. This is a PIN number on an ATM card, or the answer to things like “your mother’s maiden name”.

What you are: this is some physical attribute about yourself. Your fingerprint, your eye color or the pattern of veins at the back of your eye–or the relative length of the fingers on your hand: these are all attributes which are things that you are.

What you have: this is a physical object that is in your possession. The perfect example of this is the key to your house: it is a physical object that allows you to get into your home.

The idea of two-factor authentication is simple: by having two of the three items above, you can then prove that you’re ‘you’ so you can get access to your account, your money, or your property. A perfect example of two-factor authentication is using your ATM card to withdraw money: you cannot withdraw money unless you can show you have something (the card) and you know something (the PIN number). The strength of two-factor authentication relies upon the relative strength of each factor of authentication: in a sense, the overall strength of two-factor authentication is the strength of the first times the strength of the second.

Which is why ATMs are secure even though passwords (a 4 to 6 digit number) are so bloody weak: even though the password itself is extremely weak, you also need to have something in order to withdraw money. Having something times knowing something is stronger than just knowing something by itself–even if the thing you know can be easily guessed.

This also illustrates the danger of some types of two-factor authentication: they can easily collapse into one-factor authentication (thus making it extremely easy to steal your money) through a simple act. In the case of an ATM card, two-factor authentication becomes one-factor authentication if you write your PIN number on your ATM card. Now anyone who has your ATM card can withdraw money–and they don’t have to know your PIN, they can just read it off your card.

Another example of two-factor which collapses into one-factor authentication would be a pre-printed card with a random number: you can memorize the random number on the card–essentially turning your ‘two-factor’ authentication into a memorized one-factor authentication. Not that this is bad: generally longer passwords are more secure than shorter passwords, and banks are missing out on a bet when they limit ATM passwords to 4 to 8 numbers. Even so, this really is no longer two-factor authentication–which is why there are devices out there (such as key fobs) which randomly generate a number on a synchronized clock: the number constantly changes in a seemingly random way, forcing you to have the device on your possession so you can enter the randomly generated number.

Banks have been required for the past year to come up with two-factor authentication for on-line accounts–and they have failed dismally at this, as illustrated here: Wish-It-Was Two-Factor, where banks essentially require you to pick essentially three passwords rather than one: a real password, a ‘question’ and an ‘answer’. It’s not really two-factor authentication: it’s simply a much longer (and harder to remember) password which frustrates people. And again, while having a longer password is generally more secure, it’s not two-factor authentication.

It struck me one cheap way that banks can create two-factor authentication by something you know and by something you have. It’s easy, really: when you open an account with the bank, they send you a piece of paper with a table of fifty random numbers or words or phrases, all numbered on the paper. So, for example, on that page you’d see:

1: 148191     2: 381912    3: 108910

and so forth.

When you’re asked to log in, the login dialog box then asks for three things: your username, your password, and the number that is in cell N on your access page.

By making this list long enough, you virtually have to have possession of the paper to guess the password. It becomes relatively easy for someone to then find the number, rather than to guess the answer to the question “what is your third-favorite existential author” or some other nonsense. Which makes it honest to goodness two-factor authentication–and the cost to administer this is no greater than the cost to print out an additional page to send to the user.

If you use the idea, you don’t have to credit me.

Yesterday’s Design Patterns

I blog for a variety of reasons. To practice my writing and communications skills. To vent frustration in a safe place. For public intellectual exhibitionism.

Today I blog to remember something I had forgotten.

A design pattern, generally speaking, is a common pattern that we use to solve a problem. Unfortunately that observation has been reduced by most programmers I know into an exercise in name calling–and like the practice in high school of teaching history as a dry list of dates and event names, has a way of turning a wonderful observation as to how problems can be solved using common themes into mindless recital. (The next time someone tells me “why didn’t you use a fubar pattern here?” I’m going to smack them.)

Design patterns, however, are not just a dry list of names and UML class diagrams, just as history is not a dry list of dates and event names. And some design patterns from yesterday are nearly forgotten–at least I had nearly forgotten them–which as far as I’m concerned is the same thing.

All this came about because I was reading the paper Regular Expression Matching Can Be Simple And Fast. Because I’ve spent the past couple of months locked in meetings and other brain-scrambling exercises, I figured it would be good to exercise that most important muscle, the brain, on an interesting problem: writing my own regular expression matcher engine in Java, and do it without looking at the code in the paper. (Yes, I know: I can just use java.util.regex, or compile the code in the paper–but the point here was not to mindlessly go “oook! regular expression!” like a brainless monkey. The idea here was to engage in a mental exercise. Do you think Sudoku puzzles are pulled out of thin air without first computing a solution then working backwards to create the puzzle?)


The first design pattern comes about in the discussion of the Regular Expression paper; this is the State Machine. While I’ve linked the Wikipedia paper, the key thing to remember is that a state machine consists of a set of states S and a set of actions A, such that each state {s1} triggers action a1 and transitions to a new state s2 depending on the action’s outcome.

Typically states are represented as an integer, and your state machine code often looks like:

int state = 1;
boolean done = false;

while (!done) {
    switch (state) {
        case 1:
            state = doAction1();
            break;
        case 2:
            state = doAction2();
            break;
        ...
    }
}

What’s great about state machines is twofold: first, it helps break a computational problem into small chunks of code. This can be useful if you need to write code that fits into a small footprint (like an embedded processor you’d find in a remote control or a watch or calculator), and is especially useful in this context if you are emulating a multi-threaded process in an environment where you don’t have a multi-tasking operating system to rely upon. (I once built a print spool server that ran on a 6809 embedded processor without any operating system by building the entire network stack as a sequence of states in a state machine. I could then monitor up to 8 separate incoming network connections at the same time simply by having my program walk through 8 separate chunks of memory which held the state for each incoming connection and running the state machine in each: update state in slot 1, then update state in lot 2, then…, then update state in slot 8, then repeat with slot 1, etc.)

The second thing that is great about state machines is that it allows you to represent a dynamic program which is built from some input string (such as a language specification in BNF form or a regular expression parser); in this case, each state is not an integer in a switch statement, but an object in a linked list of objects. It’s great; rather than hard-code how your program works, you load the string, build the state machine, and your execution code simply works the state machine until it’s done.


The second design pattern that came about when creating the regular expression parser–and I’m kicking myself because it took me four hours to notice this last night (because of the brain-scrambling effects of managerial meetings) is the shunting yard algorithm, which is used to convert a statement in infix notation into a parse tree which can then be turned into a state machine or solved or turned into assembly language code.

Typically the shunting yard algorithm is used to solve algebraic equations in infix notation, so we can convert the string “34 + 4*5 – (-3 + 4) * 8” into a parse tree (shown here in LISP style notation) (- (+ 34 (* 4 5)) (* (+ (- 3) 4) 8)), which can eventually be solved:

(- (+ 34 (* 4 5)) (* (+ (- 3) 4) 8))
=> (- (+ 34 20) (* (+ -3 4) 8))
=> (- 54 (* 1 8))
=> (- 54 8 )
=> 46

The nice thing about the shunting yard algorithm is that it works for any string which uses a series of prefix, postfix and infix operators: we could use it to solve, for example, -(1+2) + 3!, where the first minus sign is a negation operator, and the “!” is a factorial operator.

The algorithm itself is rather simple: you need an operator stack (a push/pop stack which stores the operators you’re using, such as ‘+’, ‘*’, etc), and an ‘eval’ stack, which can be as simple as a push/pop stack of numbers or a stack of parse trees or whatever intermediate representation you’re using to store the equation. Then it’s just a matter of reading the string as a series of tokens: each value is pushed onto the eval stack (popping prefix operators as needed after the push, updating the eval stack as needed), and each operator is pushed onto the stack, popping and evaluating operators that are of lower precedence. The details are in the Wikipedia article linked above.

What took me four hours to remember–and why I’m upset at myself today–is that this algorithm is not just good for solving math equations, but is useful for any sort of notation which is expressed as a series of prefix, infix and postfix operators.

Such as regular expressions.

When you get right down to it, a regular expression is a series of values (letters or characters you’re matching against), with a series of implicit and explicit operators.

‘Concatenation’ is an implicit operator here. Using # to represent the concatenation operator, it’s easy to see that /ABC/ (which matches the substring “ABC” in DABCDEF) can be represented as ‘A’ (i.e. “match letter ‘A'”) # ‘B’ # ‘C’.

The ‘|’ (or ‘or’ operator), which for the expression /ABC|DEF/ matches either “ABC” or “DEF”, is an explicit infix operator: the regular expression /ABC|DEF/ can be represented by the parse tree (in LISP notation): (| (# A B C) (# D E F)).

And the ‘*’ (match 0 or more), ‘+’ (match 1 or more) and ‘?’ (match 0 or 1) are all postfix operators with higher precedence than the concatenation operator: ABC* matches the string ABCCCCCCCC or AB and all in between.

The point is that all of this (including the ‘[…]’ character group, which is simply a single value object matching multiple letters, and the ‘(…)’, which operates in the same way as the parenthesis in an algebraic statement) can all be parsed using the shunting yard algorithm.


As an aside, what’s really spifilicious about the regular expression algorithm in the paper is the observation that a state doesn’t have to be a single integer value, but can actually be an array of values, with each value representing a specific action to carry out–and a single action can spawn two or more values representing two or more actions to carry out at the same time.

For example if you have a regular expression ‘a*a’, essentially meaning “match 0 or more ‘a’ characters, followed by a single ‘a’ character, and you’re matching against “aaa”, then is the second ‘a’ in that string part of the ‘a*’ (match 0 or more ‘a’ characters), or part of the terminal ‘a’ (match a single ‘a’) character? In the case of the algorithm outlined in the regular expression paper at the top of this article, the answer can be ‘both!’ In other words, you can maintain a state which is an array of vectors, and if we represent our regular expression ‘a*a’ with ‘a*’ as 1, ‘a’ as 2, and ‘done’ as 3, we could have our state go from {1} (read first ‘a’) to {1,2} (read second ‘a’) to {1,3,2} (read third ‘a’) to {1,3,3,2} (read end of string) to {3,3,3}: that is, we can simultaneously be doing multiple things at once.

This, by the way, is the point of the paper: by maintaining multiple “states” at once, or rather, by representing our current match state as a vector rather than backtracking (that is, rewinding the matcher back so we can try an alternate matching strategy), we can solve degenerate matching cases without having the program grind to a halt.


I’m blogging this so I can remember. And what I want to remember is this: that the shunting yard algorithm and the state machine are cornerstone design patterns nearly a half century old. For a regular expression, they both can be brought to bare on the problem: the shunting yard algorithm can be used to build a parse tree from a regular expression, and the parse tree can then be used to build an optimal state machine for matching strings against the regular expression.

And more importantly, they are two forgotten tools in a very old toolbox which are a hell of a lot more powerful than the ‘singleton design pattern’, which, to be honest, barely deserves to be named outside of the halls where computer scientists engage in taxonomy for the purpose of obtaining tenure.