Over the past few days Manuel Serrano has been helping me improve the speed of my n-body integrator by changing the behavior of Bigloo's threading when run in single-threaded mode on OS X. There is a function in runtime/Clib/cthread.c called denv_get which, I assume, gets some representation of the current dynamic environment. Since the cthread.c functions are really stubs for a more complete thread implementation denv_get is really short---it compiles to about 5 instructions (including prelude and post-lude) on my machine. Unfortunately, in the beta version of 2.7a from 18Aug denv_get was installed as the function to get the dynamic environment through a function pointer, preventing its inlining by the C compiler. My n-body code was spending upwards of 50% of its time in denv_get (remember: 5 instructions!). Manuel had me insert #define BGL_DYNAMIC_ENV_ALTERNATE 1 in runtime/Include/bigloo.h from the newest beta of 2.7a (12Sept)1. This apparently allowed for inlining of denv_get in single-threaded mode---presto: faster by a factor of two.
I'm mentioning this for two reasons: 1. Manuel is really cool to take so much time to help me with my performance issue. 2. If you use OS X and Bigloo 2.7a beta, you should definitely get the 12Sept version and insert #define BGL_DYNAMIC_ENV_ALTERNATE 1 into runtime/Include/bigloo.h!
In other news, Maunel has used my patch fixing a problem with the build script which was preventing the building of the compiler with shared libraries even when the --sharedcompiler=yes option was provided to configure. This makes it much simpler to build a bigloo linked against dynamic libraries on OS X. All you have to do is make sure to provide --sharedcompiler=yes (and, if you like, --sharedbde=yes) to configure. Then edit the Makefile.config, changing the following variables:
- LDFLAGS=-dynamiclib -single_module
Finally, I'd just like to put in a plug for SHARK, Apple's profiler that comes with OS X. You can profile a running process, obtaining a call graph with who-calls-who and elapsed time in each call path, see the memory usage pattern (and page faults), look at the disassembled object code with color-coded hot spots and pipeline stalls, and get helpful advice like using the reciprocal square root estimate assembly instruction rather than the math library sqrt when you don't need super accuracy. I can't imagine the amount of effort it would have been to do the profiling I've done today and yesterday with gprof! Bigloo works quite well with SHARK, since it outputs relatively idiomatic C code (though you might have to de-mangle function names with bigloo-demangle); other compilers like Gambit-C (C used as assembler) or Chicken (C used in CPS) wouldn't give you code that SHARK would help with much.
Eventually I'll get back to Scheme-y things (I haven't written a good macro in days!), but I thought it would be good to get down some of the things I've been doing lately.