Code generation in .NET with Roslyn (part 1)

28 June 2012
This post was originally published on the Softwire blog

Introduction

This post is the first in a series on code generation on the .NET platform. This is a huge subject area and there are a bewildering number of relevant libraries and tools that are useful for this purpose. I will make an effort to mention the most important third-party options (particularly where they’ve pre-empted Microsoft’s efforts), and may revisit some of them in future posts. However, for this series I will focus on those libraries and tools that are part of the .NET framework itself, or will be included in future versions. In particular, I’ll explore building code generation tools using Microsoft’s in-development compiler-as-a-service project, code-named “Roslyn“.

In this post I’ll briefly discuss what we mean by code generation, before going through some of the current tools and libraries for code generation in .NET, along with their shortcomings. I’ll also introduce Roslyn and explain why it initially caught my interest as a potentially useful library for code generation. Subsequent posts in this series will cover what I learned about using Roslyn and creating Visual Studio extensions, by looking at a specific WCF-related code generation scenario.

What is Roslyn?

Roslyn is the codename for a Microsoft project to develop a C# compiler implemented in managed (i.e. .NET) code, which exposes an API allowing you to hook in to various steps of the compilation process. CTP stands for Community Technology Preview (Microsoft’s terminology for a beta version): This is an opportunity for developers to get a sneak preview of the features that will be available in future releases of .NET & Visual Studio, and for Microsoft to gather early feedback on the direction they’re taking. The initial CTP was released back in October. Since then, Microsoft have be incorporating feedback and adding new functionality, culminating in a second CTP released this month.

What is code generation?

Code generation, at least for the purposes of this series, is any automated process for writing source code. It isn’t what happens at compile-time (which is more to do with making human-readable code machine-readable) or dynamic meta-programming at runtime (e.g. using reflection), and happens at a stage before either of these, sometimes referred to as design-time. In fact, if you’re a .NET programmer you’ve probably already used code generation in one way or another if you’ve used any of the Visual Studio ‘designer’ tools (such as those for web forms, XML datasets or Entity Framework models). These tools

Choosing a code generation process

When performing code generation, you will always have: Some input, a model for working with that input, and a model for generating your output code. The input for code generation can be almost anything, but is often either XML (e.g. a WSDL service definition or a dataset schema) or some existing .NET types (anything from an individual interface declaration to an entire class library). It’s the latter case that I’m particularly interested in and will be considering for the rest of this section.

Choosing an input model

The first decision is whether you want the input to your code generator to be raw source code, or a compiled assembly. The latter allows you to make use of powerful (and perhaps more familiar) APIs to work with the input, such as the .NET framework’s System.Reflection namespace. The main downside to this approach is that your code generator can’t run until the source types have been compiled, which can make it difficult to fit the required code generation step into your build process. This can be particularly problematic if the types that you’re outputting are closely related to the input types (which will often be the case) and conceptually belong in the same library.

If you take source code as your input model then the options are rather limited. In some very specific cases it might make sense to work on source code as if it’s plain text, but this is obviously not feasible for anything other than the most trivial input code. Another option is EnvDTE, an assembly-wrapped COM library for Visual Studio automation that allows you to access Visual Studio’s model of the source code in your project. This is a very powerful approach, but COM automation libraries are often a bit painful to work with.

Choosing an output model

The options here are also fairly limited. The simplest and most obvious option is to treat your generated code as just a bunch of text, writing to a StringBuilder or TextWriter. This is a more reasonable approach than in the case of the input model, but it’s still error-prone and the lack of strong-typing can make your code generation logic difficult to maintain. There’s a .NET library that attempts to address these issues called CodeDOM.

CodeDOM

.NET includes the CodeDOM namespace for building and manipulating source code through a strongly-typed Document Object Model (much as you might manipulate an HTML DOM using Javascript). The CodeDOM API is reasonably discoverable and contains classes to represent everything from whole assemblies (i.e. CodeCompileUnit) down to individual expressions (e.g. CodeObjectCreateExpression, CodeVariableReferenceExpression). It’s also quite well documented on MSDN (see using the CodeDOM for a good starting point). It works well for generating code with lots of structure and not much logic (making it quite suitable for the example discussed above of generating classes for an XSD schema).

Some drawbacks of CodeDOM are:

It’s extremely verbose (as you might expect) and while quite good for generating structure it can be pretty painful for generating logic
It hasn’t really moved with the times and doesn’t have any way of expressing some newer language features (it hasn’t seen any significant updates since .NET 2.0 and the corresponding versions of C# and VB)
It’s intentionally language agnostic, meaning it doesn’t have a strongly-typed way of expressing some of the more useful language-specific features, although it also has some bizarre omissions like unary operators (e.g. boolean inversion), breaks, continues, switches, while loops etc. (see Language Features which can’t be expressed using CodeDOM for a list of common grievances)

Where you need to output something CodeDOM has no way of expressing, you can use a CodeSnippetExpression, which just takes a string that will be output exactly as provided. This obviously ties your generated CodeDOM model to a specific language, which is unlikely to be a problem for a given project. However, if you used CodeSnippetExpressions extensively, you also lose most of the benefits of CodeDOM over a plain text model.

There are a few third-party projects that attempt to make CodeDOM a bit nicer to work with, by providing more succinct wrappers around the API. The most promising of these is Refly. Expressions to CodeDOM looks interesting too, although I haven’t used it myself.

Introducing Roslyn

We will discuss Roslyn in much more detail in the next post, but I’ll briefly explain why it’s interesting for code generation. As a compiler it obviously includes a parser, and the API allows you to to generate a ‘syntax tree’ from your source code (a syntax tree being a strongly-typed structure of objects representing your source code). You can also go in the opposite direction, generating source code from a syntax tree that you have built (either from scratch or, more likely, based on another tree you generated from source code). This is particularly appealing as it allows you to work with the same, strongly-typed, model for both input and output. This is something that isn’t quite possible with any of the options discussed above.

In the following post, I’ll discuss how I got on with using Roslyn to implement a fairly basic code generation scenario.