It’s been a while since my last post. I would have probably had more time for extra curricular projects if I didn’t spend so much of my life waiting for C++ applications to compile! In fact I’ve created a tool to help with exactly that…
The problem
C/C++ projects often compile painfully slowly. A large cause of this problem is “#include” statements. One included headers drags in others which drag in yet more – one combinatorial explosion later you’re left twiddling your thumbs waiting a project to compile (since hundreds of headers can take a while to open and compile).
One way to avoid headers dragging into too many other headers with them is to use forward declaration and dynamic allocation. This allows us to remove some include directives from headers. Removing an unnecessary include from a popular header car really help compile times.
Define costly
What would be nice is to find “costly” include directives in headers automatically – one could then think about using forward declarations or other refactoring to remove them. We will first need a good definition of cost.
We want a definition of “cost” of an include directive which formalises the notion of the number of “file open” operations avoided during compilation of the entire project if we omit that include directive. It makes things much simpler later on to create a formal definition capturing this notion. With a little inspection hopefully you can see that this does the trick:
Definitions
Let \(S\) be our set of source files and \(H\) be our set of headers. Let an include graph for \(S\) and \(H\) be a directed graph \(G=(V,E)\) where \(V = S \cup H \) and $$E=\{(u,v) \in V \times V : \text{ file } u \text{ includes file } v \}.$$
An include is an edge in our graph, that is \((u,v) \in E\) (which implies file u has an “#include v” statement.)
Let the set of reachable files from \(w \in V \) be
$$R(G,w)=\{x \in V : \text{ there is a path from } w \text{ to } x \text{ in } G\}.$$
Let there be an include \((u,v) \in E\) and a file \( w \in V \). Let \(G’\) be the include graph \(G\) but with include \((u,v)\) removed. Then the partial cost of include \((u,v)\) w.r.t file \(w\) is
$$C_p((u,v),w) = \left R(G,w) \right – \left R(G’,w) \right.$$
That is the partial cost from an include with respect to a file is the number of files no longer reachable from that file if the include is removed.
The cost of include \((u,v)\) is
$$ C(u,v) = \Sigma_{w \in S} C_p((u,v),w).$$
That is the cost of an include is the sum of the partial costs of that include over all source files.
Example
Some example costs for a particular include graph should make the definition more clear.
For this include graph:
We get the following include costs:
Cost: 6, from: ("e.h","f.h") Cost: 5, from: ("src/b.cpp","g.h") Cost: 4, from: ("src/d.cpp","e.h") Cost: 4, from: ("src/a.cpp","e.h") Cost: 4, from: ("f.h","j.h") Cost: 4, from: ("e.h","g.h") Cost: 3, from: ("src/c.cpp","e.h") Cost: 3, from: ("g.h","e.h") Cost: 2, from: ("g.h","i.h") Cost: 0, from: ("src/d.cpp","i.h") Cost: 0, from: ("src/c.cpp","f.h") Cost: 0, from: ("src/a.cpp","i.h") Cost: 0, from: ("h.h","j.h")
The above cost output was actually generated with incudewrangler – let’s have a quick look at how.
Some Haskell
Here is a short description of the core code of includewrangler. You can see the full code on the github page.
We can represent an include graph in Haskell with the following datatype:
 An includes graph is just a map from verts to list of verts.  (Use list instead of Set since we wat to preserve include order!) data IncludesGraph v = IGraph (Map.Map v [v]) 
We want to be able to do a depth first search on a graph to find reachable files.
edgeMap (IGraph em) = em edgesFrom graph v = (edgeMap graph) Map.! v  Depth first search on an includes graph from v.  Follows the same "search" order that C++ preprocessor would. Avoids cycles by  recording visited list  so assumes every include is guarded by ifdefs/pragma once.  Returns a set of vertices of the include graph. dfs' graph v visited = next follow where follow = filter (\v > not $ Set.member v visited) $ edgesFrom graph v descend u = dfs' graph u (Set.insert u visited) next [] = visited next (u:_) = dfs' graph v $ descend u dfs graph v = dfs' graph v (Set.fromList [v]) 
Now we can express the above definitions of the cost of an include statement easily.
 Remove edge (v,u) from an include graph. removeEdge (IGraph em) (v,u) = IGraph $ Map.adjust (filter ((/=) u)) v em  Remove node b from an include graph. removeNode (IGraph em) v = IGraph $ Map.map (filter ((/=) v)) $ Map.delete v em  The "cost" of an include directive w.r.t file w. icost' graph (u,v) w = (Set.size $ dfs graph w)  (Set.size $ dfs (removeEdge graph (u,v)) w)  The "cost" of an include directive w.r.t. to a list of files s  (i.e. list of .cpp files) icost graph s (u,v) = sum $ map (icost' graph (u,v)) $ s 
The rest of the include wrangler code deals with:
 Reading user input (for source files and include directories).
 Constructing the include graph from source files.
 And finally calculating and outputting the cost of every include directive in the project using the “icost” function.
Other features
Include wrangler has some other features.

It outputs costs for entire header files using a similar definition for “cost” as for includes. That is the cost of a header file is the number of file open operations we can avoid during compilation if we remove that header from the project.
Expensive header files could be a good for candidates for precompiled headers or general refactoring to reduce dependencies.  It outputs the include graph in graphvis format. This means you can produce nice images of your graph like the one above, or perform further analysis on the include graph.
There are probably many other useful pieces of information that could be extracted by analysing the include graph – I may add things as and when I have a need.
How to use it
Head over to the github page to download includewrangler and for instructions on how to build and run the application on your own projects.
Final thoughts
Tools to do what include wrangler does already existed, but none of them were quite right for me. The commercial tools I found which could provide similar functionality all had some of the following problems:
 Expensive.
 Operated only as visual studio plugins.
 Offered a ton of features that would get in the way.
I had no luck finding any open source software that would do the job (please let me know in the comments if there is something out there!)
Includewrangler was created to “scratch an itch”. Once it solved the particular problem I was having I regarded it as “done” so there are a few rough edges: It doesn’t fail particularly gracefully if a file/directory does not exist (although it tells you enough to fix things) has no “options” or fancy user interface and does not run as fast as it could. That said I have found it quite useful on “real world” large codebase, and so has one of my coworkers.