I’ve created a simple but hopefully effective heap profiler for windows C/C++ applications called Heapy.
Heapy requires no modifications to the program to be profiled. With a very quick setup it can profile 32 or 64 bit windows C/C++ applications. Heapy will list the top allocation sites of your application every few seconds – helping you track down memory leaks and giving you a better insight into what parts of you program are using memory.
The readme in that zip should contain enough to get you started – there’s more information on the Github Page and in the rest of this blog post.
If you want to build Heapy yourself you just need to clone it on GitHub and build with Visual Studio 2012 (the express edition should work.)
If we compile the following test application:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | // Code for TestApplication.exe #include <windows.h> #include <iostream> void LeakyFunction(){ malloc(1024*1024*5); // leak 5Mb } void NonLeakyFunction(){ auto p = malloc(1024*1024); // allocate 1Mb std::cout << "TestApplication: Sleeping..." << std::endl; Sleep(15000); free(p); // free the Mb } int main() { std::cout << "TestApplication: Creating some leaks..." << std::endl; for(int i = 0; i < 5; ++i){ LeakyFunction(); } NonLeakyFunction(); std::cout << "TestApplication: Exiting..." << std::endl; return 0; } |
We can run Heapy with the command line:
> Heapy TestApplication.exe |
Which will generate the following two reports in “Heapy_Profile.txt”
======================================= Printing top allocation points. < Trimmed out very small allocations from std::streams > Alloc size 1Mb, stack trace: NonLeakyFunction e:\sourcedirectory\heapy\testapplication\main.cpp:9 (000000013FEC1D7E) main e:\sourcedirectory\heapy\testapplication\main.cpp:22 (000000013FEC1E0D) __tmainCRTStartup f:\dd\vctools\crt_bld\self_64_amd64\crt\src\crt0.c:241 (000000013FEC67FC) BaseThreadInitThunk (00000000779A652D) RtlUserThreadStart (0000000077ADC541) Alloc size 25Mb, stack trace: LeakyFunction e:\sourcedirectory\heapy\testapplication\main.cpp:6 (000000013FEC1D5E) main e:\sourcedirectory\heapy\testapplication\main.cpp:20 (000000013FEC1E06) __tmainCRTStartup f:\dd\vctools\crt_bld\self_64_amd64\crt\src\crt0.c:241 (000000013FEC67FC) BaseThreadInitThunk (00000000779A652D) RtlUserThreadStart (0000000077ADC541) Top 13 allocations: 26.005Mb Total allocations: 26.005Mb (difference between total and top 13 allocations : 0Mb) ======================================= Printing top allocation points. < Trimmed out very small allocations from std::streams > Alloc size 25Mb, stack trace: LeakyFunction e:\sourcedirectory\heapy\testapplication\main.cpp:6 (000000013FEC1D5E) main e:\sourcedirectory\heapy\testapplication\main.cpp:20 (000000013FEC1E06) __tmainCRTStartup f:\dd\vctools\crt_bld\self_64_amd64\crt\src\crt0.c:241 (000000013FEC67FC) BaseThreadInitThunk (00000000779A652D) RtlUserThreadStart (0000000077ADC541) Top 5 allocations: 25.005Mb Total allocations: 25.005Mb (difference between total and top 5 allocations : 0Mb) |
The rest of this post is focused on why and how I constructed Heapy.
Occasionally when developing a piece of software one has a desire to know what parts of a program are using up memory. Sometimes there’s a tricky resource leak or a need to understand which areas of code legitimately (but perhaps unpredictably) allocate a lot of memory. In Java we have the the wonderful VisualVM which can inspect a dump of the entire heap of an application and do memory profiling as an application runs – I expect similar tools exist for other interpreted or JITted languages. The situation is not as nice for C/C++: you simply can’t walk though the heap and profiling tools are limited. On Linux we can do pretty nice memory profiling with Gperftools or Valgrinds Massif. There didn’t seem to be a free/easy to use equivalent to Gperftools or Massif for windows.
I knew that creating a heap profiler for Windows wouldn’t be too tricky so I decided to give it a go myself! Due to it’s small size I also think Heapy serves as a decent introduction to DLL injection and function hooking so I’ve used the rest of this blog post to describe it in some detail.
The first decision I made was that the application to be profiled should not have to be modified in order to be profiled. This meant that the only way to go about this would be to inject the profiling code into the application.
After a fair amount of research I settled on DLL (Dynamic-link library) injection and function hooking as the best way to pull this off. DLL injection involes using an “injector” application to “inject” a thread running code from a DLL into a process. Once the DLL code is running it can do anything – it turns out that it’s possible to “hook” functions in a program so that they will call code from our DLL instead of (or in addition to) the original function.
I’ll call our injector application “Heapy.exe” and our injected DLL “HeapyInject.dll”. Here’s a step by step description of how Heapy works:
I expect fairly curious people would not be fully satisfied with the above description.
I was surprised at how easy and “well supported” DLL injection is in Windows. The key things the Win32 API lets us do is create a thread in different process (using CreateRemoteThread) and allocate and set memory in the virtual address space of a remote process (using VirtualAllocEx and WriteProcessMemory).
To call CreateRemoteThread we have to supply an address of a function which takes a single pointer parameter and returns a DWORD (a.k.a THREAD_START_ROUTINE). The magic is that this THREAD_START_ROUTINE is compatible (enough) the the type of the function Win32 function LoadLibrary! Piecing all this together we can create a thread running in the target process running our DLLs DllMain. Here’s how:
Take a look at Heapy.cpp for the full process spawing and DLL injection code. There is also a great deal of information about DLL injection elsewhere online.
Function hooking is replacing a function, at runtime, with a different function. To be really useful a function hooking technique needs to provide a way to call the old function. In Heapy the injected thread needs to hook the malloc and free functions in order to profile them. I should say now that hooking malloc and free catches calls to new and delete (at least in all the target applications compiled with visual studio that I tried.)
I let MinHook do the function hooking heavy lifting. EasyHook also does hooking and injection – but it’s Hooking for C/C++ functions didn’t seem as good as MinHooks (I think the core focus of EasyHook is hooking C# or CLR applications which I’m not yet interested in).
Even with MinHook doing the heavy lifting there is still a little work to be done. I wanted to be able to target any C/C++ application created with pretty much any version of visual studio. This means that the malloc and free functions inside my MinHook DLL would often not be the same malloc and free that are actually being used in the target application! Aside: This happens to be one of the main reasons why we can’t blindly use a dll from one version of visual studio in an application written in another – we have to make sure that a malloc made in one dll is never freed by one in another if you want to mix and match compiler versions.
The problem is not insurmountable. We can use the DbgHelp library to enumeate all modules, find malloc and frees in those modules and hook them with our profiled malloc and free functions. The gory details are in HeapyInject.cpp.
Once we’ve hooked allocation functions we need to figure out what information to collect and how to report it. The approach that the Gpertools and Massif take is to group all allocations by the call stack to that allocation. This makes a lot of sense: it’s a nice way to show “where” in your program was allocating. If we ignored the full call stack we would probably get useless information such as “90% of your allocations are in some standard container allocation function”. With grouping by stack traces we can get more useful information such as: “this chain of function calls allocated hundreds of megabytes in the form of some standard container”.
With this goal in mind here’s what happens when we hit a hooked malloc function in Heapy:
When we hit a hooked free call:
With the data maintained above we are capable of getting a list of active (that is not freed) allocations at any time to generate reports. With careful use of hash-maps (std::unordered_map) our profiling functions are not too costly. Even with locks for thread safety the cost of maintaining this information is tiny compared to the cost of taking lots of capturing the stack traces.
For the reports I went for something very simple: just printing the top 25 allocations points and the amount allocated every few seconds and once at application exit. I used the dbghelp library again to print nice symbols for the stack trace (as long as a .pdb file can be found).
This simple reporting is enough for 90% of use cases. We can see what parts of our allocation allocate the largest amounts of memory. It lets us catch leaks on exit and leaks at runtime if they grow large. Extending the reporting would be very easy to help in particular cases. Ideally one day I would like to add a full featured user interface, but for now this simple reporting has proved useful enough.
Well that was a lot of writing about a few hundred lines of code. Hopefully someone will find all of these details interesting! Even if that does not happen I already found Heapy to be a useful tool – perhaps other people will too.
]]>
The first step was to extended the expression evaluator of JitCalc to evaluate its function for all the pixels in some input images and store the results in some output image. This was achieved by adding:
Since most of the code generation snippets I showed last time were very simple let’s take a look at the different methods from JitCalc and Pixslam. Feel free skip the code and scroll down to the nice images below (there is even a GIF!)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | // Generate method from JitCalc. // Pretty much just the boilerplate for an AsmJit function call and an "eval" FuncPtrType generate(const Cell &c){ compiler.newFunc(AsmJit::kX86FuncConvDefault, AsmJit::FuncBuilder1<double, const double *>()); AsmJit::XmmVar retVar = eval(c); compiler.ret(retVar); compiler.endFunc(); return reinterpret_cast<FuncPtrType>(compiler.make()); } // Generate method from Pixslam. // Deals with image arguments, settings some useful symbols and looping over the image. // The "AsmJit::XmmVar retVar = eval(c);" remains though! JitImageFunction::FuncPtrType JitImageFunction::generate(const Cell &c){ using namespace AsmJit; compiler.newFunc(AsmJit::kX86FuncConvDefault, AsmJit::FuncBuilder5<void, Arguments, size_t, size_t, size_t, double *>()); // Bind input array of image pointers to AsmJit vars. GpVar pargv = compiler.getGpArg(0); for(size_t i = 0; i < argNameToIndex.size(); ++i){ argv.push_back(compiler.newGpVar()); compiler.mov(argv.back(), ptr(pargv, i*sizeof(double))); } // Setup some useful constants. zero = compiler.newXmmVar(); one = compiler.newXmmVar(); SetXmmVar(compiler, zero, 0.0); SetXmmVar(compiler, one, 1.0); w = compiler.getGpArg(1); h = compiler.getGpArg(2); stride = compiler.getGpArg(3); out = compiler.getGpArg(4); // Convert above into doubles so they can be bound to symbols. wd = compiler.newXmmVar(); hd = compiler.newXmmVar(); compiler.cvtsi2sd(wd, w); compiler.cvtsi2sd(hd, h); symbols["w"] = wd; symbols["h"] = hd; // Perpare loop vars n = compiler.newGpVar(); compiler.mov(n, w); compiler.imul(n, h); currentIndex = compiler.newGpVar(); compiler.mov(currentIndex, imm(0)); currentI = compiler.newGpVar(); currentJ = compiler.newGpVar(); compiler.mov(currentI, imm(0)); compiler.mov(currentJ, imm(0)); // for i = 0..h // for j = 0..w Label startLoop(compiler.newLabel()); compiler.bind(startLoop); { compiler.mov(currentIndex, currentI); compiler.imul(currentIndex, stride); compiler.add(currentIndex, currentJ); // im(i,j) = f(x) AsmJit::XmmVar retVar = eval(c); compiler.movq(ptr(out, currentIndex, kScale8Times), retVar); } compiler.add(currentJ, imm(1)); compiler.cmp(currentJ, w); compiler.jne(startLoop); compiler.mov(currentJ, imm(0)); compiler.add(currentI, imm(1)); compiler.cmp(currentI, h); compiler.jne(startLoop); compiler.endFunc(); return reinterpret_cast<FuncPtrType>(compiler.make()); } |
The results of this are fairly dull, but it is an important stepping stone. With this in place we can do the following:
1 2 3 4 5 6 7 | ; Add two images together. ; Command line: pixslam compose.psm lena.png duck.png compose_out.png ((A B) ( * 0.5 ; normalise back to [0,1] ( + A B) ) ) |
Of course this alone would not allow for much image processing – we need to be able to perform operations using more than one pixel value from input images at once for true image processing. In many image processing algorithms we want to look at a neighborhood of pixels around the one we are processing. Relative indexing helps us implement these kinds of algorithms neatly.
In Pixslam “realive indexing” is the following: processing pixel \((i,j)\) then (A y x)
evaluates to pixel \((i+y, j+x)\). Relative indexing was added to the expression evaluator framework simply by having the function handler run a special case when an argument name is used as a symbol. Note we are using row/column indexing – this is actually fairy common when dealing with images (for example Matlab, OpenCV). Once we have this we can do some standard image processing – like a simple box blur.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | ; Normalized 3x3 box filter. ; That is: replace each pixel with the average value of the pixels in 3x3 neighbourhood ((A) ( / ( + (A -1 -1) (A -1 0) (A -1 1) (A 0 -1) (A 0 0) (A 0 1) (A 1 -1) (A 1 0) (A 1 1) ) 9 ) ) |
Absolute indexing is important too. It allows us to perform “global” operations like rotating an image. Pixslam provides the following syntax for absolute indexing: (@A i j)
evaluates to the value of the the pixel at row \(i\) and column \(j\) in image \(A\). The symbols width
and height
are bound to the width and height of the image. Again implementing this was a case of extending the function handler and special variables \(i\) and \(j\) were dealt with by extending the symbol handle.
1 2 3 | ; Flip image vertically. ; Demos absolute indexing operator. ((A) (@A (- height i) j)) |
A more interesting use of absolute indexing is to create images from mathematical expressions on the indices only. In the example below Metaballs are drawn, the input image is only used to specify the size of the result. Have a look at the examples directory to see how this is done.
Now sprinkle in some extra operations: min
, max
and comparisons. We are now ready to have some fun.
If we allow Pixslam to operate recursively on the same image it is Turing complete. Proof: Here’s Conway’s game of life in Pixslam!
Almost everything above is described in more detail in the project readme. In addition all the images above are automatically generated during the build process – take a look at the examples directory.
]]>In the future I hope to present useful and novel applications of code generation – but for now I just want to demonstrate the simplest thing I think you can call just-in-time (JIT) code generation without blushing.
Skip right to the code on GitHub or continue reading for a full description.
Update: You can see more exciting JIT code generation for image processing in this follow up post.
The key to making native code generation “fun” and simple is using a library to carry some of the load. We want to at least be working something like assembly – not worrying about generating the actual binary code for the CPU. AsmJit is an excellent C++ library which provides a “run-time assembler” for x86-64 code generation and just enough extra functionality to make code generation a breeze. Here’s a few things AsmJit helps with:
Above the assembler layer AsmJit offers a “Compiler.” The main thing the compiler does is let us allocate as many variables as we want and handle register allocation for us. Using the AsmJit compiler is like writing assembler for a CPU with infinite registers, I find it to be the perfect level at which to work. For the fastest code you may need to drop down to the lower level assembler – but most experiments can be started with the compiler. Check out the examples on the AsmJit wiki for very simple demonstrations of what AsmJit does and the difference between the low level assembler and higher level compiler.
There is not a huge amount written online about AsmJit. Hopefully the example in this blog post will help people get started! I use a recent SVN revision of AsmJit – it looks like the developer has made a fair few changes to the public interface since the last release and intends to make a new release soon.
I thought one of the simplest, just aboutuseful, things that could be made with AsmJit is a mathematical expression evaluator. By that I mean: at run time we can specify a mathematical function – like f(x,y) = x + y*2
and have our program call it with any arguments. Many applications call for such functionality.
Let’s save parsing for another blog post and just use LISP like S-expressions instead of standard mathematical notation. Here’s a few examples:
(+ x y) equivalent to: x + y (+ (/ y 2) (+ x 1)) equivalent to: (y / 2) + (x +1) (- (+ y (/ x 2)) 1) equivalent to: (y + (x / 2)) + 1
I looked at this wonderful blog post as a starting point for parsing and representing S-expressions in C++. With some of the code from there we can parse an S-expression from a string into the following C++ struct:
1 2 3 4 5 6 7 8 9 10 11 | // S-Expression structure. struct Cell{ enum Type {Symbol, Number, List}; typedef Cell (*proc_type)(const std::vector<Cell> &); typedef std::vector<Cell>::const_iterator iter; Type type; std::string val; std::vector<Cell> list; Cell(Type type = Symbol) : type(type) {} Cell(Type type, const std::string & val) : type(type), val(val) {} }; |
We’re going to have to “visit” entire S-Expressions and apply certain operations when we encounter a number, symbol or function call. Here’s a helper class for that:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | // Generic templated visitor base class. template <typename EvalReturn> class Visitor{ public: typedef std::map<std::string, std::function<EvalReturn (const std::vector<EvalReturn> &)>> FunctionMap; typedef std::function<EvalReturn (const std::string &symbol)> SymbolHandler; typedef std::function<EvalReturn (const std::string &number)> NumberHandler; protected: FunctionMap functionMap; NumberHandler numberHandler; SymbolHandler symbolHandler; public: Visitor(){ } EvalReturn eval(const Cell &c){ switch(c.type){ case Cell::Number:{ return numberHandler(c.val.c_str()); }case Cell::List:{ std::vector<EvalReturn> evalArgs(c.list.size()-1); // eval each argument std::transform(c.list.begin()+1, c.list.end(), evalArgs.begin(), [=](const Cell &c) -> EvalReturn{ return this->eval(c); }); if(functionMap.find(c.list[0].val) == functionMap.end()) throw std::runtime_error("Could not handle procedure: " + c.list[0].val); // call function specified by symbol map with evaled arguments return functionMap.at(c.list[0].val)(evalArgs); }case Cell::Symbol:{ if(symbolHandler) return symbolHandler(c.val); else std::runtime_error("Cannot handle symbol: " + c.val); } } std::runtime_error("Should never get here."); return EvalReturn(); // quiet compiler warning. } }; |
From there we can very quickly get to an interpreted calculator. We will do the JIT stuff after we “master” the interpreted approach!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | // Interpreted calculator without variables (no symbolHandler!) class Calculator : public Visitor<double>{ public: Calculator(){ // standard functions functionMap["+"] = [](const std::vector<double> &d){return d[0] + d[1];}; functionMap["-"] = [](const std::vector<double> &d){return d[0] - d[1];}; functionMap["/"] = [](const std::vector<double> &d){return d[0] / d[1];}; functionMap["*"] = [](const std::vector<double> &d){return d[0] * d[1];}; numberHandler = [](const std::string &number){ return std::atof(number.c_str()); }; } }; |
Calling the eval method of the above class with an expression (with just numbers, no symbols) will return the evaluation of the that expression. In other words we have a simple calculator.
Calculator().eval("(+ 1 (2 *3)"); // will return 7! |
From there it’s very simple to extend this to a function object which will accept a LISP like representation of a mathematical function with an arbitrary number of parameters.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | // Extend calculator above into function evaluator. class CalculatorFunction : public Calculator{ private: std::map<std::string, int> argNameToIndex; Cell cell; public: CalculatorFunction(const std::vector<std::string> &names, const Cell &c) : cell(c){ for(size_t i = 0; i < names.size(); ++i) argNameToIndex[names[i]] = i; } double operator()(const std::vector<double> &args){ symbolHandler = [&](const std::string &name) -> double{ return args[this->argNameToIndex[name]]; }; return eval(cell); } }; |
We can now do the following:
1 2 3 4 5 6 7 | std::vector<double> argNames = {"x", "y"}; CalculatorFunction f(argNames, "(+ x y")); // create the function f(x,y) = x + y // Call our function with x = 2 and y = 3 std::vector<double> args = {2.0 3.0}; double z = f(args); // z = f(x,y) = x + y = 2 + 3 = 5! std::cout << z; // prints 5 |
Now we have seen how to do things in the interpreted fashion lets create a JIT compiled function evaluator! We’ll use the AsmJit compiler to make our lives easy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | // JIT version of CalculatorFunction class. // Expressions return AsmJit SSE "registers"/variables. class CodeGenCalculatorFunction : public Visitor<AsmJit::XmmVar>{ private: AsmJit::X86Compiler compiler; std::map<std::string, int> argNameToIndex; typedef double (*FuncPtrType)(const double * args); FuncPtrType generatedFunction; public: CodeGenCalculatorFunction(const std::vector<std::string> &names, const Cell &cell){ using namespace AsmJit; // Map operators to assembly instructions functionMap["+"] = [&](const std::vector<XmmVar> &args) -> XmmVar{ compiler.addsd(args[0], args[1]); return args[0]; }; functionMap["-"] = [&](const std::vector<XmmVar> &args) -> XmmVar{ compiler.subsd(args[0], args[1]); return args[0]; }; functionMap["*"] = [&](const std::vector<XmmVar> &args) -> XmmVar{ compiler.mulsd(args[0], args[1]); return args[0]; }; functionMap["/"] = [&](const std::vector<XmmVar> &args) -> XmmVar{ compiler.divsd(args[0], args[1]); return args[0]; }; // Convert numbers into AsmJit vars. numberHandler = [&](const std::string &number) -> XmmVar{ double x = std::atof(number.c_str()); XmmVar xVar(compiler.newXmmVar()); SetXmmVar(compiler, xVar, x); return xVar; }; for(size_t i = 0; i < names.size(); ++i) argNameToIndex[names[i]] = i; symbolHandler = [&](const std::string name) -> XmmVar{ // Lookup name in args and return AsmJit variable // with the arg loaded in. // TODO: this could be more efficient - could // create one list of XmmVars and use that. GpVar ptr(compiler.getGpArg(0)); XmmVar v(compiler.newXmmVar()); int offset = argNameToIndex.at(name)*sizeof(double); compiler.movsd(v, Mem(ptr, offset)); return v; }; generatedFunction = generate(cell); } FuncPtrType generate(const Cell &c){ compiler.newFunc(AsmJit::kX86FuncConvDefault, AsmJit::FuncBuilder1<double, const double *>()); AsmJit::XmmVar retVar = eval(c); compiler.ret(retVar); compiler.endFunc(); return reinterpret_cast<FuncPtrType>(compiler.make()); } double operator()(const std::vector<double> &args) const { return generatedFunction(&args[0]); } ~CodeGenCalculatorFunction(){ AsmJit::MemoryManager::getGlobal()->free((void*)generatedFunction); } private: void SetXmmVar(AsmJit::X86Compiler &c, AsmJit::XmmVar &v, double d){ using namespace AsmJit; // No immediates for SSE regs/doubles. So put into a general purpose reg // and then move into SSE - we could do better than this. GpVar gpreg(c.newGpVar()); uint64_t *i = reinterpret_cast<uint64_t*>(&d); c.mov(gpreg, i[0]); c.movq(v, gpreg); c.unuse(gpreg); } }; |
That wasn’t so hard. The key thing to notice is that when we “evaulate” now we are simply visiting the expression and returning AsmJit variables. As we evaluate we push the relevant instructions to perform the arithmetic operations on our operands (which are AsmJit variables) into the AsmJit compiler. SSE registers and instructions were used (all that XmmVar stuff) wince it turns out to be easier. Observe how AsmJit freed us from having to worry about register allocation. See also how we call the generated code just like any C++ function pointer: generatedFunction(&args[0]);
.
We can now replace any instance of the CalculatorFunction class with an instance of CodeGenCalculatorFunction. Construction will be slower (it’s generating code!) but evaluation should be much faster. We have turned our S-expression into native code!
Full code can be found on GitHub along with some documentation.
The project contains a simple command-line interface for testing and bench-marking our CalculatorFunction classes. The JIT code runs about 100 times faster than the interpreted equivalents – not bad for less than 100 extra lines!
$ ./jitcalc -benchmark "((x y) (+ (* (+ x 20) y) (/ x (+ y 1))))" 15.5 20 Interpreted output: 710.738 Code gen output: 710.738 Benchmarking... Duration for 10000000 repeated evaluations: - Interpreted: 5732ms - JIT: 52ms
Well that’s all for now. I intend to present slightly more useful examples of code generation in the future!
]]>One of my courseworks at university was to implement ant colony optimization for the traveling salesman problem.
My solution was one of the more self contained and interesting pieces of my work at uni. So I put it up on GitHub.
I was amazed at how much faster this algorithm was compared to the earlier genetic algorithm based approaches I tried during the course. The only non vanilla part of the implementation is an approximation for the Math.pow function in java which sped up the whole program up by a decent amount.
// Approximate power function, Math.pow is quite slow and we don't need accuracy. // See: // http://martin.ankerl.com/2007/10/04/optimized-pow-approximation-for-java-and-c-c/ // Important facts: // - >25 times faster // - Extreme cases can lead to error of 25% - but usually less. // - Does not harm results -- not surprising for a stochastic algorithm. public static double pow(final double a, final double b) { final int x = (int) (Double.doubleToLongBits(a) >> 32); final int y = (int) (b * (x - 1072632447) + 1072632447); return Double.longBitsToDouble(((long) y) << 32); } |
C/C++ projects often compile painfully slowly. A large cause of this problem is “#include” statements. One included headers drags in others which drag in yet more – one combinatorial explosion later you’re left twiddling your thumbs waiting a project to compile (since hundreds of headers can take a while to open and compile).
One way to avoid headers dragging into too many other headers with them is to use forward declaration and dynamic allocation. This allows us to remove some include directives from headers. Removing an unnecessary include from a popular header car really help compile times.
What would be nice is to find “costly” include directives in headers automatically – one could then think about using forward declarations or other refactoring to remove them. We will first need a good definition of cost.
We want a definition of “cost” of an include directive which formalises the notion of the number of “file open” operations avoided during compilation of the entire project if we omit that include directive. It makes things much simpler later on to create a formal definition capturing this notion. With a little inspection hopefully you can see that this does the trick:
Let \(S\) be our set of source files and \(H\) be our set of headers. Let an include graph for \(S\) and \(H\) be a directed graph \(G=(V,E)\) where \(V = S \cup H \) and $$E=\{(u,v) \in V \times V : \text{ file } u \text{ includes file } v \}.$$
An include is an edge in our graph, that is \((u,v) \in E\) (which implies file u has an “#include v” statement.)
Let the set of reachable files from \(w \in V \) be
$$R(G,w)=\{x \in V : \text{ there is a path from } w \text{ to } x \text{ in } G\}.$$
Let there be an include \((u,v) \in E\) and a file \( w \in V \). Let \(G’\) be the include graph \(G\) but with include \((u,v)\) removed. Then the partial cost of include \((u,v)\) w.r.t file \(w\) is
$$C_p((u,v),w) = \left| R(G,w) \right| – \left| R(G’,w) \right|.$$
That is the partial cost from an include with respect to a file is the number of files no longer reachable from that file if the include is removed.
The cost of include \((u,v)\) is
$$ C(u,v) = \Sigma_{w \in S} C_p((u,v),w).$$
That is the cost of an include is the sum of the partial costs of that include over all source files.
Some example costs for a particular include graph should make the definition more clear.
For this include graph:
We get the following include costs:
Cost: 6, from: ("e.h","f.h") Cost: 5, from: ("src/b.cpp","g.h") Cost: 4, from: ("src/d.cpp","e.h") Cost: 4, from: ("src/a.cpp","e.h") Cost: 4, from: ("f.h","j.h") Cost: 4, from: ("e.h","g.h") Cost: 3, from: ("src/c.cpp","e.h") Cost: 3, from: ("g.h","e.h") Cost: 2, from: ("g.h","i.h") Cost: 0, from: ("src/d.cpp","i.h") Cost: 0, from: ("src/c.cpp","f.h") Cost: 0, from: ("src/a.cpp","i.h") Cost: 0, from: ("h.h","j.h")
The above cost output was actually generated with incude-wrangler – let’s have a quick look at how.
Here is a short description of the core code of include-wrangler. You can see the full code on the github page.
We can represent an include graph in Haskell with the following datatype:
-- An includes graph is just a map from verts to list of verts. -- (Use list instead of Set since we wat to preserve include order!) data IncludesGraph v = IGraph (Map.Map v [v]) |
We want to be able to do a depth first search on a graph to find reachable files.
edgeMap (IGraph em) = em edgesFrom graph v = (edgeMap graph) Map.! v -- Depth first search on an includes graph from v. -- Follows the same "search" order that C++ preprocessor would. Avoids cycles by -- recording visited list - so assumes every include is guarded by ifdefs/pragma once. -- Returns a set of vertices of the include graph. dfs' graph v visited = next follow where follow = filter (\v -> not $ Set.member v visited) $ edgesFrom graph v descend u = dfs' graph u (Set.insert u visited) next [] = visited next (u:_) = dfs' graph v $ descend u dfs graph v = dfs' graph v (Set.fromList [v]) |
Now we can express the above definitions of the cost of an include statement easily.
-- Remove edge (v,u) from an include graph. removeEdge (IGraph em) (v,u) = IGraph $ Map.adjust (filter ((/=) u)) v em -- Remove node b from an include graph. removeNode (IGraph em) v = IGraph $ Map.map (filter ((/=) v)) $ Map.delete v em -- The "cost" of an include directive w.r.t file w. icost' graph (u,v) w = (Set.size $ dfs graph w) - (Set.size $ dfs (removeEdge graph (u,v)) w) -- The "cost" of an include directive w.r.t. to a list of files s - (i.e. list of .cpp files) icost graph s (u,v) = sum $ map (icost' graph (u,v)) $ s |
The rest of the include wrangler code deals with:
Include wrangler has some other features.
There are probably many other useful pieces of information that could be extracted by analysing the include graph – I may add things as and when I have a need.
Head over to the github page to download include-wrangler and for instructions on how to build and run the application on your own projects.
Tools to do what include wrangler does already existed, but none of them were quite right for me. The commercial tools I found which could provide similar functionality all had some of the following problems:
I had no luck finding any open source software that would do the job (please let me know in the comments if there is something out there!)
Include-wrangler was created to “scratch an itch”. Once it solved the particular problem I was having I regarded it as “done” so there are a few rough edges: It doesn’t fail particularly gracefully if a file/directory does not exist (although it tells you enough to fix things) has no “options” or fancy user interface and does not run as fast as it could. That said I have found it quite useful on “real world” large codebase, and so has one of my coworkers.
]]>Here it is. A modern browser will be required.
I’m not sure what it is. It looks nice, but strange.
Technical details:
I used processing to create it and then processing.js to put it on a web page. It’s driven by a system of ODEs based on a particle/spring system with oscillating spring lengths and “magnetic walls”. A common fourth-order Runge–Kutta method is used to solve the ODEs.
]]>On Linux Matlab R2011b annoyingly requires the fairly old gcc 4.3.4 to compile C/C++ “mex” functions. This is a problem for recent versions of Ubuntu 64-bit. If you attempt to use a more recent version of gcc you get errors like “libstdc++.so.6: version `GLIBCXX_3.4.11′ not found” when trying to run the mex files. This is because matlab uses it’s own libc/stdc++, and only includes them for older compilers.
The fix is to install gcc 3.4 and point matlab to it by editing mexopts.sh. Annoyingly there is no package for gcc 3.4 on Ubuntu 11.10, so you have to build it yourself. The reason for this blog post is that the build isn’t quite “textbook” and it took me a while to get it to work. Hopefully this guide might save some googleing soul the hassle, and I’ll be able to point co-workers to it when they upgrade.
Here’s the commands to build and install gcc 4.3.4. The line which sets “LIBRARY_PATH” is what took a little while to track down, recent ubuntu/debian have reorganised /usr/lib, which unsurprisingly breaks lots of builds. If you don’t do this you’ll get an error saying “/usr/bin/ld: cannot find crti.o: No such file or directory” during build.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | # download gcc 4.3.4 wget ftp://ftp.mirrorservice.org/sites/sourceware.org/pub/gcc/releases/gcc-4.3.4/gcc-4.3.4.tar.bz2 # extract, make and cd into separate build dir tar -xvf gcc-4.3.4.tar.bz2tar -xvf gcc-4.3.4.tar.bz2 mkdir gcc-4.3.4-build cd gcc-4.3.4-build/ # Install build dependencies, this should get all of them sudo apt-get build-dep gcc-4.5 # the "fix" export LIBRARY_PATH=/usr/lib/x86_64-linux-gnu # configure, set prefix so install will only touch the /opt/gcc-4.3.4/ directory ./gcc-4.3.4/configure --prefix=/opt/gcc-4.3.4/ # build, specify number of cpu corse after -j make -j<number of cpu cores> # install sudo make install |
Now you just need to point Matlab to this version of gcc.
Run “mex -setup” from matlab to ensure you have a mexopts.sh file – it should tell you where it is. Edit the mexopts.sh (mine was in $HOME/.matlab/R2011b/mexopts.sh):
Now you should be able to compile and run Matlab mex files.
EDIT: If you’re not running 64-bit ubuntu this guide might work, but you don’t need the “export LIBRARY_PATH=/usr/lib/x86_64-linux-gnu” commands before compiling or added to mexopts.sh. (Or maybe you do but you can loose the “_64” part.)
]]>Before I show the code I’ll have to very briefly introduce the discrete cosine transform (DCT). We should be able to ignore the maths and implementation of the DCT and treat it as a magic box which comes with Matlab or octave. If your interested in the details (and they are interesting) this book is a great place to start if you want more depth than wikipedia offers.
An audio sample is a sequence real numbers \( X = \{x_1, \ldots x_N\} \). The DCT of this audio sample is the sequence, \( DCT(X) = Y = \{y_1, \ldots, y_N \} \) such that
$$ x_n = \sum_{k=1}^n y_k w(k) cos\left( \frac{\pi(2n-1)(k-1)}{2N} \right) $$
where
$$ w(k) =\cases{\frac{1}{\sqrt{N}}, & k=1 \cr \sqrt{\frac{2}{N}}, & \text{otherwise}}.$$
Don’t worry too much about that expression. We just need note that the DCT represents the original signal as a sum of cosines, and that the coefficients specify the amplitude of these cosines.
If we have the DCT coefficients we can transform them back to the original sequence with the inverse discrete cosine transform (IDCT). This could be calculated with the above expression but more efficient algorithms exist for both the DCT and IDCT (these algorithms are based on the fast Fourier transform, which is again an interesting topic that I won’t get into).
So what does this have to do with audio compression? The coefficients of the DCT are amplitudes of cosines that are “within” the original single. Small coefficients will result in cosines with small amplitudes, which we are less likely to hear. So instead of storing the original sample we could take the DCT of the sample, discard small coefficients, and keep that. We would store fewer numbers and so compress the audio data.
The decompression algorithm would be simple, we would simply take the IDCT of whatever we stored play that back. We will be missing some of the signal, but one of the properties of DCT’s is that a few of the larger coefficients account for a large amount of the power in the original signal. Also the coefficients we discard will usually be from quiet high frequency parts of the sound, which we hear less. These are some of the reasons why DCT is often used in compression.
There are a few details we are missing. When compressing with DCTs you typically compress small slices (windows) of the audio at once. This is partly so that seeking through the compressed stream is easier but mostly because we want the coefficients in our window to represent frequencies we hear (with large window the majority of the coefficients would represent frequencies well out of the human hearing range).
In addition we need to consider the binary format of the data. We could store the results of the DCT as floating point values, but that would be 32 bits per coefficient – which seems a little high given that .wav format files are stored as 16 bit integers. So let’s instead linearly map the range of the DCT coefficients to 16 bit integers and store those instead.
We’ll have to store not just the coefficients, but their index too, let’s store them as 16 bit integers as well. It may seem inefficient to do this since a few bits of our integers will never be used. This is somewhat offset by Matlab/Octave saving files with gzip compression. This will be able to compress those runs of zeros caused by using an overly large integer fairly well. This is a bit of a kludge, we are keeping things simple, so using non Matlab data types would be out of the question. After some testing I realised that I could map the actual coefficients to \(n\) bit range (with \( n < 16 \)), store them in 16 bit integers, and still get a saving in space which was nearly as good as using real \(n\) bit integers! I think that pretty much cover it! Here's the code for compression:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | % Simple DCT compression. % Works in matlab with signal processing toolbox or octave. % X : (audio) samples, vector with each element in [-1,1] % window : window size, length(X) must be divisible by this. % num_components : number of DCT components to store per window. % coeff_bits: number of bits to use to store each coefficient. function result = compress_dct(X, window, num_components, coeff_bits) num_win = length(X)/window; X = reshape(X, window, num_win); % reshape so each window is a row Y = dct(X); % applies dct to each row % find top components and their indices [a, I] = sort(abs(Y), 'descend'); I = I(1:num_components, :); % build struct result.coeffs = int16(zeros(num_components, num_win)); result.ind = int16(I); result.window = window; result.coeff_bits = coeff_bits; for i = 1:num_win % store each coefficient (in [-1,1]) as an integer mapped to range % (-2^(coeff_bits-1), 2^(coeff_bits_1)) result.coeffs(:,i) = int16(Y(I(:,i), i)*2^(coeff_bits-1)); end end |
Here’s the decompression function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | function X = decompress_dct(data) num_win = size(data.coeffs, 2); coeffs = double(data.coeffs)/(2^(data.coeff_bits-1)); % Rescale coeffs to [-1,1] % Construct full DCT windows from sparse. Y = zeros(data.window, num_win); for i = 1:num_win Y(data.ind(:,i),i) = coeffs(:,i); end % Inverse DCT each window. X = idct(Y); % Stitch windows into one long vector. X = reshape(X, num_win*data.window, 1); end |
And here’s an example of using compression and decompression functions from Octave.
1 2 3 4 5 6 7 8 9 10 11 12 13 | window_size = 2048; n_coeffs_keep = 100; coeff_n_bits = 10; % Load wav file, must be mono, number of samples divisible by window size. X = wavread('bach_clip.wav'); % Compress wav comp = compress_dct(X, window_size, n_coeffs_keep, coeff_n_bits); % Save comp structure in a binary format with extra gzip compression % so we can see how big it really is. save -binary -z bach.mat comp % Decompress and write back to wav for comparison. Xdecomp = decompress_dct(comp); wavwrite(Xdecomp, 44100, 16, 'bach_decomp.wav'); |
Here are some examples of a compressed piece of audio at various settings.
SImple audio compression demonstration by Luke Dodd
I realise that soundcloud will stream in mp3 which could obscure the results, but the compression artefacts are large enough to hear though mp3 and you can download the .wavs though soundcloud if you want. EDIT: The soundcloud player seems to be very noisy for some reason – I suggest you click download and listen to the .wavs.
The most annoying artifact, that occurs even in the higher bit rate example, is a slight “clicking” noise. I think this is caused by the windowing – the sample is not forced to be “continuous” over the boundaries of windows so you hear small clicks on windows where it does not line up. Aside from that the highest bitrate version is not totally awful to listen to, although even a fairly poor set of headphones I can hear “garbling”. The cool thing is that even two mid bit-rate streams are fairly intelligible (e.g you could probably understand speech), which is impressive considering the level of compression achieved and the simplicity of the code. The lowest bit rate stream is really bad, but it’s a good example of what very drastic lossy audio compression sounds like.
In conclusion I think it’s rather impressive how far you can get with lossy audio compression by only using the DCT and some generic lossless compression. A core part of MP3 audio compression is the DCT, but MP3 goes well beyond this to achieve much better results.
]]>Solving sudoku by reduction to CNF-sat is hardly a new idea, I’m sure a quick google would show various approaches. If the mathematical notation below looks a little scary perhaps try reading the code further down, it really is quite simple if you can read propositional formulas.
A sudoku puzzle is a 9×9 grid of cells, split into 9 non overlapping 3×3 “boxes”. Some of these will be labelled with a digit from 1 to 9, others will be blank. The aim is to label the remaining cells so that every row, column and box contains the digits 1 through 9.
Perhaps the most well known satisfiability problem is propositional satisfiability: given a formula in propositional logic can we find assignments for all the variables which makes the formula true? A subset of this problem is satisfiability on formulas in conjunctive normal form (CNF-SAT). A formula is in conjunctive normal form if it is a conjunction of clauses, where a clause is a disjunction of literals (a literal is just a variable or its negation), e.g
\[ (a \vee \neg b) \wedge (\neg a \vee c \vee d) \wedge (\neg b \vee \wedge \neg d) \] \[ a \wedge (\neg a \wedge \neg b) \]
Our aim is, given an unsolved sudoku grid, to identify a set of variables and a CNF formula upon them which is satisfiable if and only if the gird has a solution. Additionally we want the assignment of variables (the model) to describe unambiguously the solution to the puzzle. We’ll proceed by first finding a way to represent the grid with boolean variables, then find formulas which ensure the grid satisfies the sudoku constraints and matches the given puzzle grid.
Let us begin by identifying a way to describe sudoku grids with a set of boolean variables which would be easy to enforce sudoku constraints upon with a propositional formula. Let our boolean variables be \( \{x_{i,j,k} : i,j,k \in \{1,\ldots,9\} \}\). The interpretation is that if \( x_{i,j,k} \) is true then cell \( (i,j) \) is labelled with k.
Of course this representation does not enforce that every cell has exactly one label! We’ll have to include a term in the reduction formula that ensures this is the case. If a cell has more than one label than it will have some pair of distinct labels. So given a cell \( (i,j) \) the we can check it has exactly one label with a CNF formula
\[ l_{i,j}:=(x_{i,j,1} \vee \ldots \vee x_{i,j,9}) \bigwedge_{k,l \in \{1,\ldots,9\}}_{k \neq l} \neg x_{i,j,k} \vee \neg x_{i,j,l}.\]
So to ensure the model describes a labelling we just need the formula
\[ \sigma := \bigwedge_{i,j \in \{ 1,\ldots,9 \}} l_{i,j} \] to be true. Note this is in CNF.
Now we need to include the sudoku constraints: each row, column and box needs to contain the labels 1 through 9. Let a group be a set of 9 cells that must contain all the digit. Let \( \mathcal{G} \) be a set of sets containing the all the groups on a sudoku grid, i.e groups for the rows, columns and boxes \(3 \times 3 \) boxes. We can now express the sudoku constraint with the following CNF formula:
\[ \phi = \underbrace{\bigwedge_{k \in \{1,\ldots,9\}}}_{\textrm{for each digit}} \underbrace{\left( \bigvee_{(i,j)\in\mathcal{G}} x_{i,j,k} \right)}_{\textrm{each group has}}_{ \textrm{a cell labeled with it}} \]
Once again this is a CNF.
Now we just need to include constraints that restrict cells that were labelled in the puzzle grid to keep that label. Let
\[ \mathcal{C}:=\{(i,j,k):\textrm{cell $(i,j)$ has label $k$ in puzzle grid}\}. \]
Then the formula
\[ \psi := \bigwedge_{(i,j,k) \in \mathcal{C}}x_{i,j,k}\]
ensures that the solution matches the original puzzle (again \( \psi \) is in CNF).
So given a puzzle the formula
\[ \sigma \wedge \phi \wedge \psi \] is satisfied if and only if the variables \( \{x_{i,j,k} : i,j,k \in \{1,\ldots,9\} \}\) describe a solution to the puzzle. Cell \( (i,j) \) is labelled \( k \) if and only if \( x_{i,j,k} \) is true.
It’s easy enough to code this up in Haskell. The following code extract constructs the formula form a sudoku grid. A variable in MiniSat is a positive integer, so I had to create a bijection from labelled cells (triples) to integers. A CNF formula in the code below is just a list of list of integers.
-- cartesean product of a list with itself cross list = [(x,y) | x <- list, y <- list] -- Sudoku cells values are represented by triples where the first and second -- entry specify row and column respectively (zero indexed) and the third -- specifies labelling (1-9). -- Define a bijection between cell value triples and natural numbers that will -- serve as boolean variable names. cellToVar (i,j,k) = fromIntegral $ (k-1)*81 + i*9 + j + 1 varToCell x = ((i `mod` 81) `div` 9, i `mod` 9, (i `div` 81)+1) where i = (fromIntegral x)-1 -- List of clauses that ensures a given cell is labeled with exactly one value. -- Checks for every pair of labels that the cell is NOT labeled by both -- and that the cell is is labeled with at least one value. oneLabel (i,j) = atLeastOne : lessThan2 where notBoth (c1,c2) = [- cellToVar (i,j,c1), - cellToVar (i,j,c2)] lessThan2 = map notBoth $ [(i,j) | (i,j) <- cross [1..9], i /= j] atLeastOne = map cellToVar [(i,j,k) | k <- [1..9]] -- List of clauses that ensures every cell has exactly one label. validLabeling = foldr ((++).oneLabel) [] $ cross [0..8] -- Definition: A group of cells is a set of cells that must contain -- one of all the labels. i.e. One of the column, rows or 3x3 squares. -- List of the square groups of cells. squareGroups = [quadrent i j | (i,j) <- cross [0..2]] where quadrent x y = [(x*3+i,y*3+j) | (i,j) <- cross [0..2]] -- List of rows, list of cols. rows = [[(i,j) | i <- [0..8]] | j <- [0..8]] cols = [[(i,j) | j <- [0..8]] | i <- [0..8]] -- Formula that ensures a group of cells contains at least one of all labels [1-9]. groupGood group = foldr ((:).label) [] [1..9] where label k = map cellToVar [(i,j,k) | (i,j) <- group ] -- Formula ensuring a labeling is good. -- A labeling is "good" if it satisfies the sudoku constraints, that is every -- square, row and cell contains one of each label. goodLabeling = foldr ((++).groupGood) [] (squareGroups ++ rows ++ cols) -- Produce a formula for a set of sudoku constraints - filled in cells, -- for which a model describes a sudoku solution. sudokuForm cells = validLabeling ++ goodLabeling ++ (map consClause cells) where consClause cell = [cellToVar cell] |
To see the full code, which includes IO, have a look at the github.
Considering that we’ve at no point actually thought about how to solve sudokus this technique works remarkably well. This solver can reliably solve “hard” instances quickly, simple naïve solvers can take a long time on these. The github contains some harder puzzles, some coming from this article.
Of course specialist algorithms will do even better, but we’ve pretty much solved the problem with no effort. In a future post I hope to describe some harder problems I’ve tackled with boolean satisfiability, in these cases there were no existing algorithms to solve the problem and SAT (but not CNF-SAT) reduction turned out to be the best of all the techniques I tried!
]]>