← Back to Papers Distributed Systems

Low-Latency Event-Driven Workflow Orchestration for LLM Systems

December 2025 Topics: Distributed Systems, Workflow Orchestration, DAG Execution, Rate Limit Management

Abstract

This paper describes a workflow orchestration engine designed for low-latency, event-driven job execution in LLM systems. The architecture addresses the unique challenges of executing AI workflows that depend on rate-limited external API providers, where managing requests-per-second (RPS) and tokens-per-second (TPS) limits across distributed workers is critical to preventing mass rate limit failures.

The system implements a multi-queue architecture supporting DAG (Directed Acyclic Graph) workflows with dependency management, priority-based scheduling, and a two-stage resource allocation model. Virtual resource pools enforce global quotas (API rate limits, token budgets) separately from physical worker assignment (CPU, memory).

Because scheduling and job execution are fully event-driven, the system introduces no unnecessary delay between stages — from workflow submission to resource allocation to job delivery, each transition fires immediately. The event-driven architecture is designed for end-to-end orchestration overhead on the order of milliseconds, with the dominant latency being the actual job execution time itself.

Workflow definitions use a subset of GitHub Actions YAML syntax, enabling AI systems to author, validate, and execute complex multi-step workflows programmatically.

Note: Formal benchmarks with detailed methodology and hardware specifications are in progress and will be published on our GitHub. The latency characteristics described in this paper reflect design properties and goals of the event-driven architecture. Published measurements will follow.

1. The Problem

Distributed job execution of LLM workflows depending on external API providers presents several interconnected challenges:

Rate Limit Management at Scale

External LLM providers impose strict rate limits at multiple levels:

  • Requests per second (RPS): Maximum API calls within a time window
  • Tokens per second (TPS): Maximum tokens processed per unit time
  • Concurrent request limits: Maximum simultaneous in-flight requests
  • Per-model and per-provider limits: Different quotas for different models

When 50 workers simultaneously attempt API calls against a 50 RPS limit, all may receive rate limit errors due to timing coordination issues.

DAG Workflow Dependencies

Complex LLM workflows involve multiple interdependent steps: data extraction, transformation, multiple LLM calls, result aggregation, and post-processing. Job B cannot start until Job A completes successfully.

Low-Latency Requirements

Because these workflows power real-time user interfaces, the orchestration layer must add minimal overhead. Every unnecessary delay between job completion and the next job's start compounds into perceptible user latency. The scheduling system itself should never be the bottleneck — given available resources, a job should begin executing as fast as the system can deliver it.

LLM-Writable Workflow Definitions

AI agents must author workflows dynamically. A subset of GitHub Actions YAML syntax provides a format with extensive training data, clear semantics, and proven LLM writability.

2. Why Build a Custom Solution?

We were unable to find an existing workflow orchestration system that brings together the specific combination of capabilities required for this use case. The key requirements that drove a custom implementation:

What We Needed

  • Event-driven scheduling - No polling-based scheduling loops. When a job completes or resources become available, the next scheduling decision should fire immediately via event callbacks, not on a timer
  • Built-in rate limit management - First-class support for LLM provider quotas (RPS, TPS, concurrent requests) as virtual resource pools that gate job scheduling, not just retry-after-failure
  • Two-stage resource allocation - Separate enforcement of global business constraints (API quotas) from physical compute assignment (CPU, memory, GPU), because these are fundamentally different resource types
  • DAG workflow execution - Native support for job dependency graphs where downstream jobs execute immediately when their dependencies complete
  • LLM-writable workflow format - A workflow definition format that AI agents can reliably generate, with extensive training data and clear semantics
  • Centralized quota enforcement - Rate limits must be enforced at a central point, not distributed across workers where coordination is impractical

Gap Analysis

Existing tools excel in their respective domains — general-purpose task queues, batch DAG schedulers, visual workflow builders, and durable workflow engines each solve important problems. However, few if any combine event-driven scheduling with built-in virtual resource pools for API rate limits as a first-class scheduling primitive. Most workflow systems treat external API quotas as an application-level concern (retry on 429 errors) rather than a scheduling-level concern (prevent over-scheduling in the first place).

The core gap: virtual resource allocation — managing rate-limited third-party API integrations as a first-class scheduling constraint rather than an application-level retry problem.

3. Design Influences

GitHub Actions: YAML Workflow Syntax

  • Developer Familiarity: runs-on, needs, and steps keywords are immediately recognizable
  • LLM-Writability: Extensive training data means LLMs generate valid workflows
  • Declarative Simplicity: YAML's readability reduces errors

Kubernetes Scheduler

  • Label-Based Matching: runs-on mirrors node selectors
  • Allocation Strategies: BinPacking and Spread strategies implemented

DAG Theory

  • Kahn's Algorithm: Used for cycle detection and topological sorting
  • Transitive Reduction: Track only unsatisfied dependencies

4. Architecture Overview

+--------------------------------------------------------------+ |                    QUEUE MANAGER SERVICE                     | +--------------------------------------------------------------+ |  +--------------------------------------------------------+  | |  |                   ORCHESTRATOR                         |  | |  |  * Workflow lifecycle management                       |  | |  |  * Job state machine                                   |  | |  |  * Dependency tracking                                 |  | |  +--------------------------------------------------------+  | |         |                    |                    |          | |  +------------+    +--------------+    +----------------+    | |  | SCHEDULER  |    | EXEC COORD   |    | HEALTH MONITOR |    | |  | * Priority |    | * Timeouts   |    | * Heartbeats   |    | |  | * Ordering |    | * Callbacks  |    | * Worker mgmt  |    | |  +------------+    +--------------+    +----------------+    | |         |                                                    | |  +------------------------------------------------------+   | |  |              RESOURCE MANAGERS                        |   | |  |  +-------------------+    +------------------------+  |   | |  |  | Virtual (Quotas)  |    | Physical (Workers)     |  |   | |  |  | * OPENAI_RPS: 50  |    | * CPU: 16000m          |  |   | |  |  | * ANTHROPIC_TPS:  |    | * Memory: 64GB         |  |   | |  |  |   100000          |    | * GPU: 4               |  |   | |  |  +-------------------+    +------------------------+  |   | |  +------------------------------------------------------+   | +--------------------------------------------------------------+

Worker Labels

Workers register with labels describing their capabilities:

  • Hardware labels: gpu, high-memory, nvme-storage
  • Network labels: internet-access, vpc-internal-only
  • Software labels: python3.11, cuda-12, ffmpeg
  • Environment labels: production, staging, development

Job State Machine

Submitted -> Pending --(deps satisfied)--> Ready | (priority ordering) | (virtual resources reserved) | (physical worker assigned) | Assigned -> Running | +-------------+-------------+-------------+ |             |             |             | Completed      Failed       Timeout      Cancelled

When a job fails or times out, all downstream dependent jobs are automatically cancelled (cascade cancellation). This fail-fast behavior prevents wasted compute on workflows that cannot complete.

5. Technical Implementation

Two-Stage Resource Allocation

The core architectural pattern separates global quota enforcement from physical worker assignment:

Stage 1: Virtual Resource Reservation

Job Ready (passed priority ordering)
    |
    v
Check virtual resource pools:
    +-- OPENAI_RPS: need 5, pool has 50 available
    +-- ANTHROPIC_TPS: need 1000, pool has 50000 available
    |
    v
Reserve from ALL required pools atomically
    |
    +-- If ANY pool insufficient: Move to virtual resource wait queue
    +-- If ALL pools sufficient: Proceed to Stage 2

Stage 2: Physical Worker Assignment

Virtual resources reserved
    |
    v
Find workers matching labels (runs-on: ["gpu", "internet-access"])
    |
    v
For each matching worker, check:
    +-- CPU: need 2000m, worker has 4000m available
    +-- Memory: need 8GB, worker has 16GB available
    |
    v
Apply strategy (least-loaded or best-fit)
    |
    +-- If no worker available: Release virtual resources, move to physical wait queue
    +-- If worker found: Assign job, deliver via gRPC stream

Why Separation Matters: An API quota pool of 50 RPS prevents scheduling 51+ concurrent API-calling jobs even if 200 worker slots are available.

Rate Limit Modeling: Capacity Slots vs. True Rate Limiting

The virtual resource pool model treats RPS and TPS limits as reservable capacity slots — a job declares it needs 5 RPS, and the pool tracks 50 total available. This is an approximation: RPS is fundamentally a rate over time, not a concurrent capacity count. A job holding a "5 RPS" reservation might burst above that rate momentarily or might make no requests during idle periods. The pool model works well as a concurrency-gating mechanism but is not equivalent to true rate limiting (e.g., sliding window or token bucket algorithms).

In practice, the system supports two enforcement strategies:

  • Estimation-based reservation: The job declares its expected resource consumption at submission time, either specified explicitly or derived from historical execution data for that job type. The scheduler reserves capacity accordingly. This approach is simple and fast, but carries a risk of slight over-utilization if the job exceeds its estimate, or under-utilization if it consumes less than reserved.
  • Enforcement-based reservation: The reservation parameters are passed down to the job execution runtime, which actively enforces the rate limit during execution. If the job would exceed its reserved rate, it applies an artificial slowdown to stay within bounds. This approach prevents over-utilization at the cost of slightly more complex job execution logic.

Over-utilization is the more critical concern because exceeding provider rate limits results in rejected requests that waste both time and compute. Under-utilization reduces throughput efficiency but does not cause failures, making it the less urgent problem to optimize.

Atomic Multi-Pool Reservation

Atomicity across multiple virtual resource pools is achieved through a scheduler mutex that protects the two-stage allocation decision for each job. When a job requires resources from multiple pools (e.g., both OPENAI_RPS and OPENAI_TPS), the scheduler reserves from each pool in sequence under a single lock, then assigns a physical worker. If any pool has insufficient capacity or no suitable worker is available, all prior reservations for that job are rolled back before the lock is released. This guarantees that partial reservations never leak.

Importantly, the scheduler mutex serializes only the resource allocation decision itself — it does not serialize the entire system. The architecture runs multiple specialized goroutines concurrently:

  • Scheduling goroutine: Processes jobs from the priority queue and makes allocation decisions. Triggered by condition variable signals rather than polling, so it wakes immediately when new work is available.
  • Drain goroutine: Monitors virtual and physical resource wait queues and moves jobs back to the ready queue when resources become available.
  • Health monitoring goroutine: Tracks worker liveness via heartbeats on a fixed interval.
  • Event processing goroutine: Handles job lifecycle events (started, completed, failed, timeout) through a buffered channel, decoupling event ingestion from processing.
  • Per-job execution goroutines: Each scheduled job runs in its own goroutine, so job execution does not block scheduling.

The primary serialization point is the scheduler mutex during two-stage allocation. Each virtual resource pool maintains its own read-write lock, and each worker maintains independent state protected by its own lock. Under high concurrency, the scheduler mutex becomes the throughput bottleneck — a known tradeoff that favors correctness and simplicity over maximum scheduling throughput. For the target workload (LLM API calls where individual job execution takes seconds to minutes), the scheduling overhead per job is negligible relative to execution time. Potential future optimizations include batch scheduling (allocating multiple jobs per lock acquisition) and finer-grained locking that separates the virtual and physical allocation phases.

Event-Driven Scheduling

Scheduling and job execution are fully event-driven — no polling loops determine when the next job should run. Every state transition triggers immediate re-evaluation:

  • Job completion: Dependent jobs are evaluated and enqueued immediately via a condition variable signal
  • Resource release: Jobs waiting for virtual or physical resources are drained back to the ready queue and reconsidered
  • Worker communication: Jobs are delivered to workers via persistent gRPC streams, eliminating request-response overhead

Periodic health monitoring (heartbeats, worker liveness) does use interval-based checks, as disconnect detection cannot always rely on event-driven mechanisms alone. However, the scheduling and execution path — from job submission through resource allocation to worker delivery — contains no polling.

Token Requirement Estimation

TPS-based virtual resource pools require knowing a job's approximate token consumption before scheduling. Pre-execution token estimation for LLM calls is inherently imprecise — it depends on prompt length, maximum output token settings, and model-specific tokenization. The system supports two mechanisms for determining token requirements:

  • Explicit declaration: The workflow submission specifies token requirements per job. This is appropriate when the submitting service has sufficient context to estimate consumption (e.g., known prompt templates with predictable lengths).
  • Historical profiling: The system monitors token consumption from previous executions of each job type and computes a rolling estimate biased toward the high end (closer to observed maximums than averages). This approach improves over time as execution history accumulates and avoids the need for callers to manually estimate token usage.

6. Workflow Definition Syntax

Workflow definitions use a subset of GitHub Actions YAML syntax. This choice is deliberate: GitHub Actions YAML is well-represented in LLM training data, making it reliably writable by AI agents.

Example Workflow

name: content-processing-pipeline
env:
  SOURCE_ID: "src-abc123"

jobs:
  extract:
    runs-on: any
    steps:
      - id: fetch-content
        name: Fetch source content
        run: fetch-content --source $SOURCE_ID --output /tmp/raw.json
    outputs:
      content:
        value: ${{ steps.fetch-content.outputs.result }}

  analyze:
    runs-on: any
    needs: extract
    env:
      INPUT: ${{ jobs.extract.outputs.content }}
    steps:
      - id: run-analysis
        name: Run LLM analysis
        run: analyze --input "$INPUT" --model gpt-4
        timeout-minutes: 10
    outputs:
      analysis:
        value: ${{ steps.run-analysis.outputs.result }}
        storage: s3

  summarize:
    runs-on: any
    needs: extract
    steps:
      - name: Generate summary
        run: summarize --input "${{ jobs.extract.outputs.content }}"

  aggregate:
    runs-on: any
    needs: [analyze, summarize]
    steps:
      - name: Combine results
        run: aggregate --analysis "${{ jobs.analyze.outputs.analysis }}"

In this example, analyze and summarize run in parallel after extract completes. aggregate waits for both to finish before executing.

Supported GitHub Actions Subset

FeatureStatus
jobs.<id>.needs (dependencies)Supported
jobs.<id>.runs-on (worker labels)Supported
jobs.<id>.steps[*] (run, shell, env, if)Supported
jobs.<id>.outputs (inter-job data passing)Supported
jobs.<id>.strategy.matrix (parallel variants)Supported
jobs.<id>.if (conditional execution)Supported
env (workflow, job, step scopes)Supported
Expression syntax ($)Supported
jobs.<id>.timeout-minutesSupported
jobs.<id>.continue-on-errorSupported

Custom Extensions

The system extends GitHub Actions syntax in two areas:

  • Output storage: Job outputs can specify storage: s3 to enable durable storage of large results, or default to local to use in-memory passing. This is not part of the GitHub Actions specification
  • Resource requirements: Jobs can declare resources with cpu (millicores), memory (MB), and tags used in physical allocation. This replaces GitHub Actions' runner-group concept with explicit resource requests

What We Removed

The following GitHub Actions features are intentionally excluded:

  • Event triggers (on: push, on: schedule) - Workflows are submitted programmatically via gRPC, not triggered by repository events
  • Container configuration (jobs.<id>.container) - Execution targets are worker processes, not containers
  • Reusable workflows (jobs.<id>.uses — workflow references) - Currently each workflow is self-contained
  • Actions marketplace (steps[*].uses — third-party actions) - Steps execute shell commands directly

7. Failure Handling and Resilience

Design Philosophy: Ephemeral Jobs

The system is designed around the principle that workflow jobs are primarily ephemeral. A job either succeeds and produces an output, or fails and can be retried or resubmitted. The information a workflow needs is either contained in its YAML definition or referenced by IDs that point to durable storage (databases, object stores). Jobs do not maintain long-lived local state.

The target workloads for this system are LLM-centric: calling language model APIs, collecting and processing results, and saving data into storage systems. Workflows are not expected to perform transactional side effects such as sending emails, processing payments, or making irreversible external state changes. That responsibility belongs to the service that submits the workflow — it receives the workflow results and then performs any transactional operations with its own guarantees. This separation means that replaying or retrying workflow jobs does not risk duplicating critical side effects.

Retry Semantics

Failed jobs support automatic retries with configurable backoff, defined per-job in the workflow definition. When a job fails due to a transient error (e.g., a rate limit rejection or temporary network issue), the system can retry the individual job without re-running the entire workflow. The retry policy includes:

  • Max retries: Configurable per job (e.g., retry up to 3 times)
  • Backoff strategy: Configurable delay between retries, supporting fixed and exponential backoff
  • Scope: Retries apply to the individual failed job, not the entire workflow. If job 10 in a 10-step workflow fails due to a rate limit, only job 10 is retried — the preceding 9 jobs do not re-execute

When a job exhausts its retry budget, the standard failure path applies: cascade cancellation of downstream dependents, or the entire workflow can be resubmitted by the calling service. Full workflow resubmission (re-executing all steps from the beginning) is only necessary in cases of complete state loss, such as a queue manager restart — an event expected to be rare in production.

Worker Health Monitoring

Workers send periodic heartbeats via their gRPC connection. The health monitor checks worker liveness on a fixed interval:

  • Heartbeat timeout: If no heartbeat is received within the configured threshold, the worker is marked unhealthy
  • Offline detection: Workers exceeding the offline threshold are removed from scheduling consideration
  • Graceful drain: Workers can signal a drain, allowing in-flight jobs to complete before the worker is removed

Job Failure and Resource Reclamation

When a job fails, times out, or its worker is lost, a deterministic cleanup sequence runs:

  1. The job is marked with its terminal state (Failed, Timeout, or Cancelled)
  2. Physical resources on the worker are released
  3. Virtual resource reservations across all pools are released
  4. The scheduling loop is signaled, which drains waiting jobs back to the ready queue to reconsider now-available resources
  5. All downstream dependent jobs are cascade-cancelled (unless the failed job is configured with continue-on-error)

In-Memory State: A Deliberate Tradeoff

Limitation: The orchestrator maintains all workflow and job state in memory. If the queue manager process restarts, all in-flight workflow state is lost. For a system powering real-time user interfaces, this is the most significant production risk in the current architecture.

This is a deliberate design choice: database-backed state persistence (writing transactions on every state change, maintaining WAL logs) would add latency to every scheduling decision, directly contradicting the low-latency design goal. The system intentionally avoids persistent storage in the scheduling hot path.

The risk is mitigated through a service responsibility model: each service that submits workflows to the queue manager is responsible for tracking its own submitted work. If a workflow exceeds an expected SLA or is no longer found in the queue manager (e.g., after a restart), the submitting service is responsible for detecting this and resubmitting the workflow. The queue manager is designed to be ephemeral by nature — it is the execution engine, not the source of truth for what work needs to be done.

Workers detect queue manager disconnection via heartbeat timeout and reconnect automatically when the service recovers.

High-Availability Roadmap

The strategy under exploration for expanding high-availability is consensus-based in-memory replication (e.g., a Raft-style protocol) that keeps state in memory across multiple nodes, potentially spanning regions. This approach would provide failover without introducing persistent storage latency into the scheduling path. Combined with cluster sharding (see Section 8), this would allow the system to tolerate node failures while maintaining its low-latency characteristics.

8. Scalability

Current Architecture

The queue manager is currently a single-process system. All scheduling decisions, resource accounting, and workflow state management happen within one process. This simplifies atomicity guarantees (the scheduler mutex protects multi-pool reservations) and eliminates distributed coordination overhead, but it means the queue manager is a single point of failure. As discussed in Section 7, this is mitigated by the service responsibility model — submitting services track their own workflows and can resubmit after a queue manager recovery.

Horizontal Scaling Path

Two complementary approaches address scaling beyond a single node:

  • Cluster sharding: Partition workloads by type (e.g., analytical workflows vs. real-time user-facing workflows) or by submitting service, with each shard running its own queue manager instance. This avoids mixing workloads with different latency and throughput characteristics and provides natural fault isolation — a failure in one shard does not affect others
  • Consensus-based replication: For high-availability within a single workload type, implementing a consensus protocol (such as Raft) would allow the queue manager state to be replicated in-memory across multiple nodes with automatic leader election and failover. This approach preserves the low-latency properties of in-memory state while providing resilience against individual node failures. Multi-region replication is also under consideration for geographic redundancy

Note: Formal throughput benchmarks (workflows/sec, jobs/sec) with detailed methodology are in progress and will be published on our GitHub.

9. Observability

Production workflow orchestration requires comprehensive observability into job execution, resource utilization, and failure patterns. The system implements a custom observability pipeline rather than relying on off-the-shelf APM solutions, driven by the high volume of jobs and workflows that would be cost-prohibitive with commercial per-event pricing.

Metrics Pipeline

Workers emit execution metrics directly into a ClickHouse instance, providing low-latency analytical queries over high-cardinality data. The administrative panel and scheduling service consume these metrics for real-time dashboards and historical analysis. Key metrics include job execution duration, resource utilization per worker, queue depth over time, and rate limit headroom per virtual resource pool.

Tracing and Correlation

All jobs carry their workflow ID, and individual job IDs can be attached to downstream LLM API requests. This enables end-to-end tracing from workflow submission through individual job execution to specific LLM provider calls. Job outputs, LLM requests, and LLM responses are logged into a centralized log system for post-hoc investigation and debugging.

A dedicated technical paper on our observability architecture and techniques is forthcoming.

10. Key Insights

  1. Separate virtual from physical resources: Quotas are global business constraints; workers are local compute capacity. Conflating them leads to either under-utilization or rate limit violations
  2. Multi-heap by state: Maintaining separate priority queues for each job state (ready, waiting-virtual, waiting-physical) provides O(1) access to the highest-priority actionable job
  3. Event-driven scheduling eliminates idle time: When resources become available, the next job begins immediately rather than waiting for a polling interval
  4. LLM-friendly formats matter: GitHub Actions YAML enables reliable AI authoring because LLMs have extensive training data for this format
  5. Centralized rate limit enforcement: Distributed workers cannot coordinate rate limits peer-to-peer without introducing the very coordination overhead the system is designed to avoid
  6. Worker labels enable heterogeneity: Route specialized workloads (GPU inference, high-memory aggregation) to capable workers without over-provisioning
  7. Ephemeral job design simplifies recovery: When jobs don't maintain local state, failure handling reduces to resubmission rather than complex checkpoint-and-resume

11. When to Use This Approach

Suitable Use Cases

  • LLM workflows with external API rate limit constraints
  • Complex job dependencies (DAG workflows)
  • Multi-tenant resource quotas
  • Low-latency scheduling requirements where polling overhead is unacceptable
  • AI-generated workflow definitions

When Other Solutions May Be Better

  • Simple queue-based processing without dependencies
  • Long-running workflows with strict durability and exactly-once requirements
  • Human-authored workflows with visual editing preference

References

  1. GitHub Actions. "Workflow syntax for GitHub Actions" - https://docs.github.com/en/actions/
  2. Kubernetes. "Scheduling" - https://kubernetes.io/docs/concepts/scheduling-eviction/
  3. Cormen et al. "Introduction to Algorithms" - Chapter 22: Graph Algorithms
  4. Dean & Barroso. "The Tail at Scale" - Communications of the ACM, 2013
  5. Ongaro, D. & Ousterhout, J. "In Search of an Understandable Consensus Algorithm (Raft)" - USENIX ATC, 2014