Error Handling in Eventide Services

One word best sums up error handling in services: Don’t.

To make the don’t rescue errors rule even more precise: Don’t rescue errors whose occurrence isn’t explicitly expected as a matter of course.

The point of errors is that they signal a circumstance so extraordinary and exceptional that there’s no way to recover. The term exception is used interchangeably with error for this reason. In the case of such an exceptional state, with so many unknowns, the best thing to do is to stop doing anything. In other words: terminate the process.

Except in rare cases, there’s usually no safe way to be absolutely certain whether an error state can be handled and recovered from. An exception can be caused by network failures, database failures, server hardware failures, or coding mistakes. In addition, an exception that is raised might be a side effect of another exception that was raised, and attempting to remediate the secondary exception may result in accidentally reacting to symptoms rather than causes.

In short, when an error is raised, don’t try to get in the way. Just let it crash the service, and let whatever process monitoring tools you use at the operating system level notify operators, or make the decision as to whether to restart the service.

All that said, there are ways to handle errors in Eventide services, and as is the case with concurrency errors, there are indeed some rare circumstances under which errors can or should be handled.

Handling Errors from Consumers

The consumer provides a mechanism to implement a generalized error handler. For more on consumers, see:

Note: The consumer’s error handler shouldn’t be used to record an error using an error reporting or application monitoring tool like AirBrake, RayGun, etc. There is a specific affordance in the component host for recording errors.

There are very few reasons to want to handle an error. However, in the case where intercepting and reacting to errors is absolutely necessary, then an error handler can be defined in a consumer class.

When an error is raised in any of the consumer’s handlers, it will be passed to the consumer’s error_raised method, along withe the message_data instance that the handler was processing when the error occurred.

class Consumer
  # ...

  def error_raised(error, message_data)
    # Do something with the error
    raise error

When the error is re-raised, it will cause the service to suspend the other consumers that it hosts, and then the service will safely terminate.

It should also be noted that if a consumer does not explicitly re-raise an error, then the error will not be able to terminate the services process. Great care should be taken with consumer error handlers in order to ensure that they serve their natural purpose of causing a service to terminate. The chief purpose of handling an error at the level of a consumer is to effect a retry in a generalized way. The specifics of retries are beyond the scope of this article.

There’s one glaring exception to the don’t rescue errors rule: When protecting against undesirable side effects of concurrent writes to the same stream using optimistic locking with a sequence number, it’s expected that the message store will raise a concurrency error (in Eventide’s case, it’s MessageStore::ExpectedVersion::Error.

It’s common practice to handle concurrency errors when using messaging patterns like Reservation Pattern, or when using multiple nodes and retries to provide a hot failover configuration, or when writing to event streams from naturally concurrent environments, like web apps.

Recording Errors from the Component Host

The component host provides a mechanism for recording errors, but not handling errors.

For more on the component host, see:

A component host that has received an error is in the process of terminating. It cannot be intercepted. It can only be recorded.

When an error is raised in any of the host’s consumers, it will be passed to the host’s record_error method. The error can then be sent to an error reporting tool.

ComponentHost.start(component_name) do |host|
  # ...

  host.record_error(error) do

The invocation of the component host’s error recorder is the last thing done by the component host before it terminates.

The combination of the consumer’s error handler and the component host’s error recorder provides a comprehensive solution for dealing with both unexpected and expected errors. These tools both play a role in the safe operation of services.

It’s critical to take great care when handling errors so that they aren’t prevented from terminating a service. And when it’s absolutely necessary, as is the case when retrying recoverable errors, a consumer’s error handler also contributes to a service’s resiliency and robustness.