Scaling to Infinity: Innovating the Worker Queue

When building applications, it's common to be coordinating frontends, databases, worker queues, and APIs. Traditionally, queues and APIs are kept separate; queues handle long-running or CPU intensive tasks and APIs serve quick responses to keep applications snappy.

Here at Paragon, we were faced with the question: how do we build a performant API for a queue that handles all our customers' tasks?"

For those not familiar with Paragon, Paragon provides a visual builder for creating APIs and workflows. Users can build cron jobs and API endpoints in minutes that connect to databases, 3rd party APIs and services, and logic or custom code for routing requests and transforming data. With that context, we had to build a worker queue that could support the following use cases:
users can run arbitrary code (steps) on the server

- the system should be able to scale to execute zero, thousands or millions of steps in parallel
- a series of steps (workflow) can be triggered by an API request or a scheduled event
- if triggered by an API request, the output of the workflow should be returned in an API response nearly instantaneously

Sounds like a Herculean feat, right? Spoiler alert: it was. This blurs the lines between a worker queue and an API, and there were no common engineering paradigms existed to draw from. As you can imagine, the security and performance implications of these product requirements kept our engineering time busy for some time.

Introducing: Hermes + Hercules

Due to the complexity, performance, and security requirements of our platform, we've had to innovate on the API + worker queue construct a bit. We created Hermes (our API messaging service) and Hercules (our workflow executor) to solve these problems.
Hermes accepts API requests, sends them to Hercules for execution, waits for the specified workflow and its steps to complete (or fail), then sends back a response to any awaiting clients. They're entirely separate services, but they communicate together to receive, schedule and execute work.

One might think that the added complexity and latency between submitting jobs to a worker queue and waiting for a response might have slowed down the API. We were quite pleased to find out the opposite: our APIs got much faster, particularly when processing large arrays of data.

Thanks to Hercules' ability to self-monitor and autoscale, we can distribute work across processes and run them in parallel. Additionally if a branch of steps fail, the others can continue to run successfully without terminating the request adding more consistency and reliability for workflows.

Related Articles