The Vector Programming Language
Vector was our team’s final project for COMS 4115: Programming Languages and Translators. Vector is a high-level language for general-purpose GPU computing. Over the course of the semester, our team developed a compiler that compiled Vector into CUDA, which could then be compiled to run on Nvidia GPUs. My teammates were Zachary Newman, Sidharth Shanker, Jonathan Yu, and Harry Lee.
General-purpose computing on the GPU (GPGPU) has become more popular lately with the introduction of CUDA and OpenCL. They have been shown to perform quite well at compute-heavy tasks, such as running scientific simulations. However, writing GPU programs in CUDA or OpenCL require a lot of boilerplate, such as choosing the number of blocks and threads to launch and copying memory to and from the GPU. The reason for this low level of abstraction is that it allows programs to be tuned and optimized for specific GPU hardware. This fine control is necessary for high-performance computing, where it’s important to squeeze out every last drop of performance, but it can be pretty daunting for programmers or researchers investigating GPGPU for the first time. At any rate, even an unoptimized GPU program will handily beat out equivalent CPU implementations at certain tasks. For this reason, we developed Vector to offer a higher level of abstraction than CUDA or OpenCL so that less experienced programmers could quickly write a parallel program and witness the power of the GPU.
Vector provides the following features over CUDA/OpenCL
- Automatic memory management
- Arrays in vector are reference-counted
- Arrays are allocated both on CPU and GPU
- Memory is automatically copied between CPU and GPU when needed
- Run code on GPU without having to define a separate kernel function
- Automatically detects what variables need to be passed in
- Number of blocks and threads/block chosen automatically
- Higher-order functions
- Specifically map and reduce
- Pass in a
__device__function and an array. Kernel will be generated for you.
Vector uses a C-like syntax. So the following is a valid Vector program.
Note that unlike in C, we did not need to include “stdio.h”, and the main
function is named
vec_main instead of main. In this post, I will focus on
the language features that make Vector different from C.
Vector has rudimentary support for type inference. For instance,
i as an int and assigns 15 to it, without having to explicitly
i as an integer. If you’ve ever used Go before,
this syntax should be familiar.
One limitation is that Vector does not do automatic type promotion, so the following will not work.
Instead, you’d have to do the following
Instead of needing explicit init, check, and increment statements like for loops in C, for loops in Vector uses ranges. So the following Vector…
…is equivalent to the following C (well actually, C++).
Everything except the middle argument in a range and the colon preceding is optional, so the following ranges are equivalent.
0:5:1 0:5 :5:1 :5
Vector arrays are like C arrays. However, Vector arrays are multi-dimensional.
Vector arrays are reference counted, so you can safely pass them into and out of functions without having to worry about freeing them.
The built-in function
len(arr) gives the size of the array
arr. If it is a
multi-dimensional array, you can give
len an integer as the second argument
to find the length along that dimension.
You can also use
for loops to iterate over the elements of an array in a
for-each type syntax.
Note that array declarations can also be type-inferenced.
Parallel for (pfor)
Our first GPU construct. the syntax of
pfor is identical to the range
The difference is that
pfor actually runs as a GPU kernel, and each iteration
of the “loop” is run in parallel in a different GPU thread. Arrays can be
read and written to on the GPU. So, for instance, if you wanted to take the
element-wise production of two arrays, you could do the following.
Each scalar multiplication happens in a separate GPU thread.
Map and Reduce
Vector provides two higher-order functions which run on the GPU. You denote
a higher-order function by pre-pending an
map function takes a function of one argument
f and an array
produces an array
b in which
b[i] = f(a[i]). So, for instance, to square
each element of an array, we can do the following.
The function passed into
@map must be declared with the
so that it is compiled for the GPU.
reduce function takes a function of two arguments and an array and applies
the function to pairs of elements in the array, then pairs of the results of
the first pass, and so on until there is a single result. This makes it useful
for, say, taking the sum of an array.
The function passed to reduce must be commutative and associative, otherwise the result of the computation is not likely to be what you would expect.
Here’s a demonstration of what Vector can do. The following is a Mandelbrot set generator that runs on the GPU.
It outputs the image as a portable graymap file on standard output. Here is what the result looks like mapping 0 to black and 255 to blue.
I wrote a similar program in C to run on the CPU.
To see how the GPU performed compared to the CPU, I ran both versions with successively larger images. Here is how the runtime of the programs scaled with image area.
The benchmarks were performed on an Amazon EC2 G2 instance, which has a Quadro K5000 GPU and 2.6 GHz Xeon CPU. These benchmarks are obviously kind of apples and oranges, since we’re benchmarking two different pieces of hardware, but suffice to say that on a decent gaming PC, the GPU could provide a speedup.
You can see the code for the benchmarks on Github.
Update: Some commenters on HN pointed out that the previous benchmark results in this article were incorrect. I have updated the post with new benchmark results.
- Error reporting is more-or-less non-existent.
- No synchronization inside
pforstatements. You will have to synchronize by breaking up your program into multiple
- No way to do code separation.
- Difficult to access the large CUDA standard library. There is a built-in
inlinefunction which allows you to insert inline CUDA code into Vector, but you’d really have to be a compiler developer to understand how to use it.
What future development? It’s a class project! What I actually hope to do is bake some of the features of Vector into a larger, more feature-rich programming language. The Julia language looks pretty attractive for this, given its macro system and focus on scientific computing. I’d also like to compile to OpenCL instead of CUDA, since that would support a larger variety of hardware.