If at First You Don’t Succeed: Retries in Eventide Microservices

Allowing a service to terminate is the most common course of action to take when an unexpected error is encountered. The key word here is unexpected.

Some errors are unexpected, and they should cause a service to terminate. But there are some errors that are expected, and they should be handled appropriately—typically by retrying whatever logic was in-progress when the error was raised.

If it’s expected that an error will occur in the course of handling some message, and if it’s expected that the error is recoverable, then care should be taken to implement the handler logic inside a retry mechanism, like Ruby’s own retry keyword, or a library that implements retry logic.

For example, if a handler invokes an HTTP API, then allowances should be made for the inevitability of intermittent and momentary network failures. Any operations that reach outside of the current process and leverage I/O—especially network I/O—are susceptible to these kinds of predictable and recoverable failures.

Because these kinds of failures are often momentary, it’s usually enough to just attempt to make the HTTP request again. If the failure is indeed a transient failure, then the network problem will clear itself up, and the second attempt (or third, etc) at invoking the API will often work.

However, if the API is experiencing some kind of extended outage, then subsequent attempts will also fail. At which point, retrying after some number (usually a small number) of attempts should be abandoned, and the error that results from the network failure should be allowed to terminate the process.

Since it’s impractical to know with absolute certainty whether a network failure is either a momentary glitch or an extended outage, it’s best to tune the retry logic to make a fixed number of attempts, and then abandon the attempt and allow the error to be raised up so that it can terminate the service.

Basic Example: Retry After an HTTP Error

In the following example, the execution of the HTTP API will retry if an HTTPError is raised. It will retry a maximum of three times, with a delay of 100 milliseconds between the first and second attempts, and a delay of 200 milliseconds between the second and third attempts.

handle SomeMessage do |some_message|
  post_data = PostData.build(some_message)

  Retry.(HTTPError, millisecond_intervals: [100, 200]) do
    SomeHttpAPI.post(post_data)
  end
end

The Retry library that ships with the Eventide stack is used in the example, but it’s not strictly necessary to use this library. There are a number of retry implementations available in the Ruby ecosystem. Ruby’s own retry keyword can be used as well, provided that it’s a good fit for the circumstances.

Be careful to avoid implementing an infinite loop of retries when implementing retry logic using Ruby’s raw begin/rescue/retry building blocks. An infinite loop of retries will be difficult to exit from safely, and may force an operator to brute-force kill a process, rather than allowing it to exit gracefully.

Retrying Concurrency Errors

Dealing with expected concurrency errors when running services in parallel in a hot failover configuration is the most common use of retry logic in handlers.

Because handler logic must be implemented with idempotence and concurrency protections irrespective of whether running services in parallel, retrying handler logic and reprocessing a message is a safe operation.

Note: A detailed discussion of idempotence and concurrency is beyond the scope of this article.

The following example is a truer representation of real handler logic implementation. The handler processes a Deposit command for an Account. It’s using the expected version mechanism to protect against concurrent writes to the same stream from two different instances of the handler that is processing the same input commands at the exact same time. It also implements idempotence protections using the command’s sequence number as an idempotence key.

handle Deposit do |deposit|
  account_id = deposit.account_id
  sequence = deposit.metadata.global_position

  # Retry once if an expected version error is raised by the write
  Retry.(MessageStore::ExpectedVersion::Error) do
    account, version = store.fetch(account_id, include: :version)

    # Idempotence protection using sequence numbers
    unless sequence > account.sequence
      logger.info(tag: :ignored) { "Command ignored (Command: #{deposit.message_type}, Account ID: #{account_id}, Account Sequence: #{account.sequence}, Deposit Sequence: #{sequence})" }
      return
    end

    time = clock.iso8601

    deposited = Deposited.follow(deposit)
    deposited.processed_time = time
    deposited.sequence = sequence

    stream_name = stream_name(account_id)

    # Write with concurrency protection using expected version
    write.(deposited, stream_name, expected_version: version)
  end
end

Again, the implementation illustrated above is only useful when running more than one instance of a service concurrently consuming the exact same input messages. It’s not a solution for horizontal scale parallelization, as that kind of parallelization requires multiple instances of a service to not consume the exact same messages. Such would be achieved by partitioning the streams that are feeding into a service so that each service receives different messages.

Retries at the Consumer Level

The error handling mechanism built into consumers offers another option for retries. However, it’s an entirely generalized solution, and doesn’t offer any flexibility to specify variation delay intervals or the errors being handled. Doing retries directly in handler logic remains the best way to have precise control over retry behavior.

Nonetheless, it is possible to implement a generalized retry in a consumer’s error_raised method.

class Consumer
  include Consumer::Postgres

  handler SomeHandler
  handler SomeOtherHandler

  def error_raised(error, message_data)
    if error.instance_of?(MessageStore::ExpectedVersion::Error)
      self.(message_data)
    else
      raise error
    end
  end
end

It isn’t really practical to use a library like Retry when effecting retries in a consumer since it would cause three attempts at handling a message, rather than two.

If implementing generalized handling of something like MessageStore::ExpectedVersion::Error, more primitive mechanism as shown above should be used. As is usual when implementing the error_raised method, it’s critical that the error passed to the method be re-raised in order to ensure that the error can terminate the process as an exceptional condition.

System-Level Process Monitoring as a Retry Mechanism

System-level process monitoring, like systemd’s systemctl, provides an even more primitive means of retrying.

If the process monitor is configured to restart a process on failure, then an error that causes the process to terminate will effectively be retried when the process restarts and the message that was in-process when the error occurred is reprocessed.

Because it’s assumed that handlers are implemented with explicit idempotence protection, it’s usually safe to restart a handler even after a service has terminated due to a fatal error.

At the system-level, basic process-monitoring provides limited controls on process restarts—and necessarily so. Unless a more elaborate, third-party process monitoring solution is being used, typical process monitoring allows for basics, like restarting on failure.

When using a system-level process monitor’s restart as a retry mechanism, the process can be put into an infinite loop of restarts. It’s usually a benign condition, but it’s something to be aware of.

It’s All About the Restarts

Every retry is a restart. A retry in a handler restarts the logic in the handler. A retry in a consumer retries the processing of a message by its handlers. A retry of a service is a restart of the service’s process at the system level.

Retrying and restarting are essential capabilities of service implementation. Arguably, service logic that cannot be restarted safely is service logic that isn’t yet ready to be released and used in an operational environment. Irrespective of why a restart is effected, it’s critical to be absolutely certain that service logic is implemented so that it can be restarted safely.

If there’s ever any doubt as to whether an of a service’s handlers can be safely restarted at any level, then restarts shouldn’t be effected at any level, whether it be at the handler level, the consumer level, or the system process level.