Vector processor

A Vector Processor (or array processor) is a CPU design that is able to run mathematical operations on a large number of data elements very quickly. The name is in contrast to a "scalar processor" (which is not an actual term, it's used only to describe CPUs that aren't vector processors) which represents the vast majority of CPUs.

In general terms, CPUs are able to deal with one or two pieces of data at a time, and manipulate them. For instance, every CPU has an instruction that essentially says "add A to B and put the result in C", and the vast majority also have "multiply A by B and put the result in C".

The data for A, B and C is - in theory at least - encoded directly into the instruction. However things are never that simple. In fact the data is rarely presented by itself, but is most often "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time.

In order to reduce the amount of time this takes, most modern CPUs use a technique known as pipelining in which the instructions pass though several sub-units in turn. The first reads the addresses and decodes them, the next gets the values, and the next does the math. With pipelining the "trick" is to start decoding the next instruction even before the first has left the CPU, in assembly line fashion. Any particular instruction takes the same amount of time to complete (the latency) but the CPU can process the entire batch much faster than if it did so one at a time.

Vector processors take this concept one step further. Instead of pipelining just the instructions, they also pipeline the data itself. They are fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there", although in reality they are handed a starting address and a number of elements.

From that point the CPU can process this array (or vector) of data much faster. Instead of constantly having to decode addresses and wait for the results, it "knows" that the next address will be one larger than the last. This allows for significant savings in decoding time. In addition the operation in question is always the same (only the data changes) so it is quite common to have several math units working in parallel, each taking a chunk of the problems.

To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language you would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To the CPU, this would look something like this...

get this number get that number add them put the result here get this number get that number

etc. Each of these instructions has to flow though the CPU's pipeline before completing, so the entire operation is sped up only by a small amount.

But to a vector processor, this task looks considerably different:

get the 10 numbers here and add them to the numbers there

Although processing this instruction might take longer to complete than a single addition in the general purpose CPU, you can see that the overall complexity of the problem is greatly reduced. Not only can is skip all of those addressed decodes, but it also has only a single command to decode as well.

But more than that, the same CPU likely has some form of superscalar implementation, meaning there isn't one part of the CPU adding up those 10 numbers, but perhaps two or four of them. Since the output of the command does not rely on the input from any other, the two parts can each add 5 of the numbers, and complete the whole operation in half the time.

Not all problems can be attacked with this sort of solution. Adding these sorts of instructions adds complexity to the core CPU, which typically suffers in more mundane parts of its performance (ie, not adding up 10 numbers in a row) as a result. The more complex instructions also adds to the complexity of the decoders, which might slow down the decoding of the more common instructions like "if".

In fact they work best only when you have large amounts of data to work on. This is why these sorts of CPUs were found primarily in supercomputers, as the supercomputers themselves were found in places like weather prediction and physics labs, where huge amounts of data exactly like this is "crunched".

The first successful implementations of a vector processor appears to be the CDC Cyber 200 and the Texas Instruments Advanced Scientific Computer. The Cyber machine was otherwise slower than CDC's own supercomputers like the CDC 7600 (and much smaller too), but at those data related tasks they could be quite a bit faster. However the machine also took considerable time decoding the vector instruction and getting ready to run the process, so in fact it typically performed quite poorly.

The technique was only fully exploited in the famous Cray-1. Instead of leaving the data in memory, the Cray design had eight "vector registers" which held 64 64bit words each. The instructions were applied between registers, which is much faster than talking to main memory. In addition the design had completely separate pipelines for different instructions, allowing a batch of vector instruction themselves to be pipelined, a technique they called vector chaining. The Cray-1 normally had a performance of about 80 Mflops, but with up to three chains running it could peak at 240 Mflops - a respectible number even today.

Other examples followed. CDC tried once again with it's ETA-10 machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. Various Japanese companies (Fujitsu, Hitachi and NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. However Cray continued to be the performance leader, continually besting the competition with a series of machines that led to the Cray-2, Cray X-MP and Cray Y-MP. Since then the supercomputer market has focussed much more on massively parallel processing rather than better implementations of vector processors.

Today the average computer at home crunches as much data watching a short QuickTime video as did all of the supercomputers up to the 1970's. Vector processor elements have since been added to almost all modern CPU designs, although they are typically referred to as SIMD. In these implementations the vector processor runs beside the main scalar CPU, and is fed data from programs that know it's there.

The NEC SX-6 supercomputer architecture is a NUMA architecture built out of SMP machines with 8 vector processors each.