Managing Complexity in Actor Systems

Introduction

This article describes practical approaches to structuring large actor systems based on patterns that work well in real enterprise ecosystem. The Actor Model is one of the most powerful abstractions for building concurrent and distributed systems. Actors encapsulate state, communicate through messages, and eliminate most shared-state concurrency problems. However, real systems quickly grow beyond the simplicity of the initial design.

Problem Statement

Actor Model solves a lot of problems: each actor has a single responsibility, state is isolated, communication is asynchronous, concurrency is handled by the runtime.

Hovewer it introduces a different challenge: managing system complexity and system organization.

Actor Proliferation - over time the system accumulates complex actor relationships
Message Flow Complexity - individual messages are simple, but message chains might be heavy, behavior emerges from interactions between actors rather than from a single piece of code
Lifecycle Management - actors must be created, monitored, and eventually terminated, system should handle orphan actors, resource leaks, uncontrolled actor growth
Debugging and Observability - instead of stack traces, developers must understand: message flows, actor interactions, distributed execution paths

Without proper instrumentation, diagnosing problems becomes extremely difficult.

Options and Approaches

Several strategies can be used to manage complexity in actor systems. Each has strengths and tradeoffs.

Domain Partitioning

One approach is organizing actors by business domain or bounded context.

Examples:

User actors manage authentication and profiles
Order actors manage order lifecycle
Inventory actors track stock levels

This mirrors Domain-Driven Design and keeps responsibilities separated.

Pros

aligns with business boundaries
reduces cross-domain dependencies
easier for teams to reason about ownership

Cons

does not control actor growth inside a domain
message flow between domains can still become complex
lacks clear execution structure for workflows

Domain partitioning is useful but typically not sufficient on its own.

Hierarchical Actor Supervision

Another approach organizes actors into hierarchical trees using supervisors.

Supervisors create and manage groups of actors and define failure handling strategies.

Typical supervision strategies include:

restart actors on failure
resume processing
stop the actor
escalate failures to parent supervisors

Pros

provides structured lifecycle management
isolates failures
simplifies resource cleanup

Cons

hierarchy alone does not model business workflows
message flow across the hierarchy can still become complex

Supervision hierarchies help with reliability but do not fully address system organization.

Workflow-Oriented Actors (FSM Pattern)

Another common pattern is implementing business logic using Finite State Machines (FSMs).

Each workflow is represented by an actor that moves through states as it processes a request.

Example:

Receive Request → Validate → Fetch Data → Process → Respond

Pros

explicit workflow representation
predictable execution paths
easier debugging of long-running processes

Cons

many FSM instances may run concurrently
resource usage can grow quickly
FSMs still need coordination and lifecycle management

FSM actors work well for workflows but must be combined with other structural patterns.

Overall Strategy

In practice, the most effective architecture combines multiple approaches.

A hybrid strategy typically includes:

Domain partitioning to separate responsibilities
hierarchical supervision to manage lifecycle and failures
workflow actors (FSMs) to represent business logic
resource actors to control access to shared infrastructure

This combination creates a predictable system structure while preserving actor flexibility.

The resulting architecture looks like this:

System Supervisors
      ↓
Domain Supervisors
      ↓
Actor Pools
      ↓
FSM Instances
      ↓
Actor Steps

Key properties of this architecture include:

Controlled Concurrency

Actor pools define how many workflows can run simultaneously.

This allows explicit control over system capacity.

Resource Consolidation

Shared infrastructure such as databases or caches should be represented by dedicated resource actors.

Instead of each workflow creating its own connection, FSM actors send messages to resource actors.

This enables:

connection pooling
request queuing
predictable load management

Elastic Workflow Scaling

FSM actors can be created dynamically when requests arrive and terminated when processing completes.

This allows the system to scale with workload while keeping persistent infrastructure actors always available.

Local Resource Actors

Some operations should run through limited actor pools to prevent resource exhaustion.

Examples include:

database access
external API calls
CPU-intensive tasks

By routing these operations through small pools of actors, the system gains natural back-pressure and queueing behavior.

System Architecture Diagram

The following diagram illustrates a possible actor hierarchy.

  %%{init: {'theme':'dark','themeVariables':{
  'primaryColor':'#2563eb',
  'primaryTextColor':'#fff',
  'lineColor':'#6b7280'
}}}%%
graph TB
    subgraph "Supervisor Level"
        S1[System Supervisor]
        S2[Domain Supervisor]
        S3[Resource Supervisor]
    end

    subgraph "Actor Pools"
        P1[Users Pool]
        P2[Dashboards Pool]
        P3[Database Pool]
        P4[Mem Cache Pool]
    end

    subgraph "FSM Instances"
        F1[FSM Instance 1]
        F2[FSM Instance 2]
        F3[FSM Instance N]
    end

    subgraph "FSM Steps"
        FS1[FSM Step 1]
        FS2[FSM Step 2]
        FS3[FSM Step N]
    end

    subgraph "Resource Actors"
        RA1[DB CRUD Actor]
        RA2[Cache Actor]
    end

    S1 --> S2
    S2 --> P1
    S2 --> P2
    S3 --> P3
    S3 --> P4

    P1 --> F1
    P2 --> F2
    P2 --> F3

    F1 --> FS1
    F1 --> FS2
    F2 --> FS2
    F2 --> FS3

    F1 -->|Request| RA1
    F2 -->|Request| RA2
    F3 -->|Request| RA1
    F3 -->|Request| RA2

This structure provides:

pool-based scaling for workflows
shared infrastructure actors
dynamic FSM lifecycle
centralized supervision

Observability and Debugging

Actor systems require strong observability from the beginning.

One useful technique is message correlation identifiers.

Each incoming request receives a unique identifier (for example a UUID v4). This identifier is propagated through all actor messages.

Benefits include:

end-to-end request tracing
correlation of distributed logs
performance analysis across actor chains

This approach is similar to distributed tracing systems used in microservices.

Conclusion

The Actor Model gives engineers a strong combination of control.

Instead of relying on layers of infrastructure to manage concurrency, failure handling, and distributed coordination, the architecture itself becomes explicit in the code. Engineers control how actors communicate, how failures propagate, how resources are shared, and how the system scales. With the right structure, an actor system becomes predictable, observable, and highly resilient.

Some of the most reliable systems ever built rely on actor-based designs.

Erlang, one of the earliest actor-oriented platforms, was designed specifically for highly available telecom infrastructure. Systems built with Erlang and its OTP framework powered large-scale telecom switches such as the Ericsson AXD301. In production deployments, the platform achieved reported availability figures approaching “nine nines” 99.9999999%, meaning only milliseconds of downtime per year under measured conditions. (Stack Overflow)

While such numbers depend on the overall system design and operational environment, the architectural principles behind them are clear:

failure isolation
message-based communication
hierarchical supervision
automatic failure recovery

Another often overlooked advantage is architectural independence from heavy platform infrastructure and clooud lock. Many capabilities commonly delegated to cloud platforms can be implemented directly inside an actor system:

distributed message routing
service discovery
job orchestration
failure recovery
load distribution

In other words, the Actor Model does not just simplify concurrency. It provides a foundation for building self-managing distributed systems.

The key takeaway is simple: the Actor Model is not just a concurrency pattern. It is an architectural toolkit that gives engineers direct control over reliability, scalability, and system behavior.

After party

I hope this post helps you to navigate complexity of actor model and encourage to try it! I’d love to hear your feedback and improve the post further.

Thank you!

P.S.: Good old human paranoia never fails.