On building a queue system (build vs buy)

Buy instead of build has become an accepted “best practice” in the tech industry in the past years. The argument usually being: you should only care about the core of your product. In other words focus on what makes your beer taste better.

This is what I had in mind when I started considering the use of a queuing system for a few tasks in our application. The basic need could be summarized by:

  1. We want to enqueue a few tasks
  2. We need to give it a task type
  3. We needed to include metadata
  4. We needed execute each task only once
  5. There should be a way to determine the uniqueness of a task. There is to account for the possible creation of many queue items, that are all covered by one execution. E.g many updates to a record that could trigger a single notification.

However context matters. And we need to accept or reject the common knowledge based on the context.

That brings us to the unavoidable question: why would you build a queuing system instead of using one of the million full-featured services available?

Why build

Let me get this out of the way: we could have used an existing feature.

Would that be better? I would imagine it would be just as effective.

But making it work is only the beginning. It needs to be maintained over time.

Here are the points I considered when making the decision:

  1. The cost that we would have to spend should we decided to build what we needed.
  2. Functional requirements–all the ones listed above.
  3. The cost of adding another tool to maintain in our stack.
  4. The cost of having to learn the new tool.
  5. The cost of having to maintain dependencies.
  6. The cost of having to test/understand what it does and what it doesn’t.

To cut a long story short: our needs were very simple and we assessed that we could go live with a simple implementation that covered strictly only what we needed. The whole implementation including testing took less than 3-days work for a single engineer.

This conclusion overweight the other costs listed above.

How we built it (architecture and decisions)

The core of the queuing system is the data structure in which it resides. The structure consists of:

  • a topic: which we use to determine how to handle it.
  • a reference id: if the queue item is related to a specific record we reference it here. E.g you want to enqueue a notification for a specific user. The user could go here if needed, or an order id or anything else that would aid the processing
  • data: this is a json field where we can add further info that will be used when processing the queue item.

That is it really. Below is the migration used to create a table in our PostgreSQL database:

CREATE TABLE "Queues" (
    "id" TEXT NOT NULL,
    "topic" TEXT NOT NULL,
    "referenceId" TEXT NOT NULL,
    "data" JSONB,
    "createdAt" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP,
    "updatedAt" TIMESTAMP(3),
);

After settling on the structure of the queue, we just needed to decide on how we’d manage the queue and for that we simply created a tiny set of API’s (endpoints) in Go.

We decided to do it in go for a few reasons:

  1. Easy to write a small service on
  2. Easy to deploy
  3. Very fast to cold start (almost instant)

The cold start aspect was particularly important since our service has some bursts in usage, which means that for most of the time we can have the service scaling down to zero instances.

Since we are storing the queue in the same database as the rest of the application, having a separate service was not a necessity but was done anyways as a way to separate the services and to leverage the aspects of Go cited above.

The features of the system where

  • Create a queue item: query that generates a new item with topic, data, etc.
  • Delete a queue item: delete items based on its id.
  • Delete all for a topic/reference id combination: clears the queue based on a combination of topic and reference id. E.g deletes all items for send-email for user id example-user-id
  • Fetch queue items (by various criteria)

The simple design allowed the operations to consist of very simple SQL queries and limited business logic. See an example of the model that creates a new queue item:

func (m *QueuesModel) Create(topic, referenceId, referenceType string, data map[string]string) error {
	q := `
	INSERT INTO "Queues" (
	"id",
	"topic",
	"referenceId",
	"referenceType",
	"data",
	"createdAt",
	"updatedAt"
	) VALUES ($1, $2, $3, $4, $5, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);
	`

	id := uuid.New().String()
	var dataJSON interface{}
	if data != nil && len(data) > 0 {
		jsonData, err := json.Marshal(data)
		if err != nil {
			// handle error
		}
		dataJSON = string(jsonData)
	} else {
		dataJSON = nil
	}
	_, err := m.DB.Exec(context.Background(), q, id, topic, referenceId, referenceType, dataJSON)

	if err != nil {
		// handle error
	}
	return nil
}

We specifically decided to not build in a way to mark items as “checked out”, which is a common feature to avoid multiple workers processing the same item.

Instead we have sequential processing, where our workers that process the items, process items sequentially in a single process. If we don’t have multiple workers, we can’t have multiple workers processing it at the same time. ;D

Additionally we also run the workers on a CRON schedule and make sure have a single schedule per task.

This is a part of the current design that I expect we’ll need to change in the future as we scale and need more complex processing. However we aim at not building preemptively.

In closing

We have now used this system in production for a few months now and it has proven to work as expected.

It has performed extremely well in handling our survey periods at Zoios where we processed hundreds of thousands of survey answers, calculations and more.

Having it build as a separate service written in Go allowed us to leverage the simplicity in building, deploying and the start up speeds that Go became known for.

Drop me a line if you are curious about any other details.