Search the website

Kraken: Our Home-Grown, Cutting-Edge Operations Platform

Serving our clients better by developing a robust and scalable task execution engine  

Posted in

To serve our clients well, our system must run certain tasks on their behalf on a daily basis. When we were a younger startup, we used a legacy system to run these tasks. However, as the number of our clients and tasks grew rapidly, we needed a more robust and scalable solution that would allow us to improve throughput while reducing the required human intervention to a minimum.

We have multiple types of tasks, each of them represented by a workflow graph. The vertices represent actions to perform while the edges represent constraints between them (topological order).

Our requirements were specific enough that there wasn’t any ready-made product we could use. So we built our own system, which we named Kraken, to address them:

  • Centralized control panel: a dashboard that displays the status of all tasks running in all servers, across all regions, down to the level of individual actions, and that allows us to edit tasks
  • High scalability and reliability: a distributed system, with no single points of failure, and monitoring using NewRelic
  • Faster execution: parallelized task actions for faster execution time
  • Multiple versions: every client instance can run on a different version of our subsystems
  • Multiple regions: agents are deployed to multiple different regions, addressing all relevant regulatory, performance and security requirements
  • Multiple platforms: we mainly use .NET, but we also have applications written in node.js (such as the Optimove Realtime Engine) that are also able to process task actions
  • Retry policy: every action is associated with its own retry policy (no more transient errors)
  • Centralized reporting: every event in the system (e.g., completed tasks and actions) is recorded for BI analysis
  • Cross-system communication: other Optimove subsystems can subscribe to events of interest in order to take action when they occur
  • Security: comprehensive authentication and authorization controls to provide granular user permissions for viewing/editing tasks

The Kraken Solution

In order to achieve all of the above, we built Kraken as a distributed system. The components of the system are:

  • Agent – execution unit that fetches commands, along with relevant parameters, and executes them
  • Process Manager – decides when to run each task, distributes workloads across regions (based on each task’s specific type of workflow), receives progress reports from agents and maintains the current execution state of each agent
  • Coordinator – the system’s central UI that allows users to add new tasks, edit task parameters and monitor currently running tasks
  • Scheduler – executes tasks at their due time, based on highly granular schedules enabled by cron expressions within the open source Quartz.Net project
  • Authorization Server – implementing the Identity Server project, an OpenID connect provider and OAuth 2.0 authorization server framework

Event-Driven Architecture

We took an event-driven approach, in which all system components communicate with each other using RabbitMQ. The benefits of this approach include:

  • The inherent benefits of using RabbitMQ are higher reliability, loose coupling, better scaling options and support for multi-platform programming.
  • Optimove subsystems dependent on the progress of certain tasks can subscribe to receive notifications of specific task-related events.
  • Upgrading individual subsystems is easier due to temporal decoupling, and further improved by switching to a continuous-delivery approach.
  • Individual system components can be horizontally scaled to support increased loads.

The Specifics

For readers interested in additional technical details, read on…

  • Agent
    • The agent monitors the work queue and fetches tasks to run.
    • It limits the number of tasks that it works on concurrently based on current capacity constraints.
    • It publishes progress update messages about tasks to subscribed subsystems.
    • It monitors a control queue for command requests, such as to kill a current task or to gracefully shut down (complete processing of all existing tasks and then shut down – allows for rolling upgrades of agents).
    • An agent can run multiple versions of our applications concurrently. Usually, multiple processes are required to run multiple versions of a DLL, but we’re able to do it with C#’s AppDomains feature (which provides a layer of isolation within a process).
  • Process Manager (see diagram below)
    • The process manager tracks the status of all tasks and is based on the “saga” concept (as described by enterprise development expert Udi Dahan in this video.
    • It is basically a powerful state machine that provides us with the ability to subscribe and publish events to the bus in order to complete the task’s workflow.
    • Every event that changes the state of a task is persisted to a database.
    • The task’s lifecycle is controlled by the events to which the saga is subscribed:
      • Creation of a new task in the Coordinator initializes the task in the process manager and scheduler – the task is now ready for execution (1).
      • When the task’s scheduled run time arrives, the scheduler publishes a ‘Task is Due’ event (2), which is received by the process manager, which then loads the task’s state, examines its current state and sends the relevant commands to the work queue (3).
      • Agents pick up the commands from the work queue (4) and begin executing the action(s). They announce the completion of actions (5), causing the process manager to send subsequent actions to the work queue.
      • When a task’s final action is completed, an event update is sent to all subscribed subsystems (e.g., the Coordinator, which marks it as completed in the grid). Additionally, an email containing an execution summary report is sent to our Operations department to examine any non-critical faults which may have occurred.
  • Coordinator
    • Streams events to the control panel using SignalR (.NET’s WebSocket implementation).
    • Updates include completed-task events that reach all the way to the client.
  • The façade layer
    • Kraken is decoupled from the underlying code that it runs. We continue adding more tasks and more task-related actions in every version.
    • The façade layer is a single interface through which we communicate with any Optimove subsystem.

Conclusion

Designing and implementing a system of such complexity and scale is always a challenge. The introduction of a distributed, message-based, event-driven application required us to increase our technology arsenal, as well as to make the necessary changes to support it.

With this project we’ve prepared the ground for further important technological advances. The infrastructure we’ve built during this project will serve our future development efforts and will further increase our competitive edge.

Published on
Posted in
Be the first to comment on this post
Enter your comment
Enter a valid email address
Submitting comment...

Stay in touch

Be the first to know all about the latest Marketing tips & tricks, Industry special insights and more