Serilog for Microservice architecture

September 29, 2019

Introduction

Below you will find some valuable practices I introduced for my current employer during last year.
If you already use Serilog and want to improve your logging process, be my guest.

1. Aggregate all the logs streams in one system

That may be a no-brainer for experienced developers of distributed systems, but still, it is worth mentioning explicitly. You should have one place where all the logs are stored. I have been using Seq but lots of other products are available.

Having one dedicated system makes it so much easier to extract information from the data. It is also more efficient in terms of computing resources. Last, but not least, it is easier to configure and maintain one system.

2. Apply properties to identify each service

After all log streams are forwarded into one system, you will need a flexible way to filter them. One way to distinguish events from various services is to enrich events with a Service property on the code level.

loggerConfiguration
    .Enrich.FromLogContext()
    .Enrich.WithProperty("Service", "OrderService")

That way, however, we are hardcoding deployment-side configuration. Much more flexible approach is to use server-side property enrichment.

Seq has the ability to apply properties at the time of arrival from the source system. API keys are used to differentiate log events.

After a few attempts to structure those properties, I found out that the two-level hierarchy of properties is the most useful. The first level is for Application (for example "Shop") whereas the second level represents a Service within the application. Of course, general services that live outside any particular application are enriched only with 'Service' property.

Silvrback blog image

When any kind of request arrives at the edges of your system, it should immediately get a globally unique identifier which should be attached to all the log events generated in response. The identifier should be persisted through time and space.

A typical scenario that illustrates the lifetime of the correlation id:

User clicks some button on your web/mobile app.
The application generates GUID and attaches it as an HTTP header.
Backend server recognizes the header and enriches all the subsequent events.
Backend server decides that it needs to send some asynchronous message via service bus. Message metadata is filled with correlation id.
Another service receives the message, recognizes the correlation id and again uses .PushProperty to enrich its log events.
In the end, the service decides to prepare an email to be sent later. Correlation id is persisted along to be used when the time to send the email will come.

This example shows how correlation id should be alive as long as any consequence of the initial request is in progress. Practice shows that company-wide correlation id is a crucial part of microservices architecture.

using (LogContext.PushProperty("CorrelationId", correlationId))
{
    await next(context);
}

Note that every system in the chain should expect correlation id to be passed but generate its own if none is received. That would ensure that area as big as possible is covered.

4. Install Serilog.Exceptions package

Serilog.Exceptions is an innocuously simple but indispensable tool. Without it, your exceptions are going to be logged using .ToString() which almost always skips some important pieces of information. The package will destructure your exceptions (including lists, dictionaries, nested objects and so on) and attach rich representation to your log events.

If you are using EntityFrameworkCode, there is an additional must-have package Serilog.Exceptions.EntityFrameworkCore.

BTW, I am the maintainer of the Serilog.Exceptions package. Visit github page for details.

5. Install SerilogAnalyzer package

Just do it. It uses Roslyn code analyzer infrastructure to pinpoint common pitfalls even before the code is compiled. There is no reason to refuse getting free help :-).

6. Use Fatal and Error only for events that need human attention

That point may be controversial but bear with me for a sec. In Serilog, we have six levels of log events: Fatal, Error, Warning, Information, Debug and Verbose. There is no logging semantics attached to any level. It is your choice as a developer on how to treat them. You could log events with exceptions at Debug level or use Fatal level to notify that user has provided an incorrect password.

On the other hand, levels can serve us best, if there will be some rules governing them, so the person analyzing the logs would be able to make sense of them. I found that if events related to known and anticipated situations like provided incomplete data for HTTP request(for example user did not provide mandatory last name) are logged on level Error, that the level is no longer indicator of a real problem. On the other hand, when only the unexpected exceptions and errors are logged on the Error level, we could take a glance at the logs and find "real problems" with our services.

Some of the expected situations I had moved to Information or Warning levels:

Transient errors from systems like databases or event buses. They will happen, even more in the cloud environment.
User requests validation errors. Assuming that your UI guides the user, there is noting erroneous about this situation.
Rejected requests from the system under maintenance. Again, they are expected so there is nothing to do about that.

7. Configure alerts for errors

The more services your organization has the harder it is to keep up and analyze all the logs. Instead of polling your logging server once in a while I advise you to set up some automated process that would monitor some carefully chosen metrics and alert development team when they are exceeded.

Typical metric to be monitored are "errors/minute", "percentage of failed requests" or
90th percentile of request processing time above threshold".

However, no metric makes sense for everybody, so you will just need to select those that work for you.

8. Correlate on various levels

In one of the previous points, we discussed correlation id related to some external endpoint. This kind of correlation is the most important one, but not the only useful one.

You can correlate log events related to a particular run of the batch process. If the batch process is transforming files, you can then recursively correlate events related to particular file. ASP.NET Core automatically correlates event that happed within the connection by enriching events with ConnectionId property.

You will thank yourself if you will specify lots of those "correlation-scopes". When you have a problem and need to understand what happened based on logs, they will be your best friend.

9. Each state change should emit the event on the Information level

Even the smallest action matter when you are trying to analyze a particular scenario from logs. Adopt the rule of leaving at least one log event for each state change.

10. Don't forget to `.CloseAndFlush`

By default Serilog sends log events asynchronously. That means that there is a chance that your application could end before all the events are actually sent over the network. It is especially important for batch applications, that starts and stops on a regular basis.

public static async Task Main(string[] args)
{
    try
    {
        var host = hostBuilder.Build();
        await host.RunAsync();
    }
    catch (Exception e)
    {
        Log.Fatal(e, "Unexpected exception caught, shutting down");
        throw;
    }
    finally
    {
        Log.CloseAndFlush();
    }
}

Explicitly flushing the buffer is not necessary for non-static loggers that are disposed of when they are done.

Artur Krajewski

Holistic software developer in .NET world