Skip to main content

Deployment Architecture

Understanding Dagster’s architecture is essential for designing robust production deployments. This guide covers the core components, their interactions, and configuration strategies.

System Components

A Dagster deployment consists of several interconnected components:
┌─────────────────────────────────────────────────────────────┐
│                      Dagster Deployment                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐      ┌──────────────┐                   │
│  │   Webserver  │◄────►│  PostgreSQL  │                   │
│  │   (GraphQL)  │      │   Storage    │                   │
│  └──────┬───────┘      └──────▲───────┘                   │
│         │                     │                            │
│         │                     │                            │
│  ┌──────▼───────┐      ┌──────┴───────┐                   │
│  │    Daemon    │◄────►│  Run Queue   │                   │
│  │  (Schedules, │      │              │                   │
│  │   Sensors)   │      │              │                   │
│  └──────┬───────┘      └──────────────┘                   │
│         │                                                  │
│         │  Launches                                        │
│         ▼                                                  │
│  ┌──────────────┐                                         │
│  │  Run Worker  │                                         │
│  │  (Container/ │                                         │
│  │     Pod)     │                                         │
│  └──────┬───────┘                                         │
│         │                                                  │
│         │  Loads code via gRPC                            │
│         ▼                                                  │
│  ┌──────────────┐                                         │
│  │  Code Server │                                         │
│  │    (gRPC)    │                                         │
│  └──────────────┘                                         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Long-Running Services

Three services must run continuously in a Dagster deployment:

1. Dagster Webserver

Purpose: Serves the UI and GraphQL API Scaling: Can run multiple replicas for high availability Configuration:
dagster-webserver \
  --host 0.0.0.0 \
  --port 3000 \
  --workspace workspace.yaml \
  --path-prefix /dagster
Health check endpoint: GET /server_info
The webserver is stateless and can be scaled horizontally. Place a load balancer in front of multiple replicas.

2. Dagster Daemon

Purpose: Manages schedules, sensors, run queuing, and backfills Scaling: Only one daemon instance per deployment Configuration:
dagster-daemon run
Running multiple daemon instances will cause duplicate schedule executions and sensor evaluations. Always run exactly one daemon.
Daemon responsibilities:
  • Evaluating schedules on their cron schedule
  • Evaluating sensors at their polling interval
  • Dequeuing runs from the run coordinator
  • Launching runs via the run launcher
  • Monitoring run health
  • Processing backfills

3. Code Location Servers

Purpose: Expose Dagster definitions via gRPC Scaling: Each code location runs one replica Configuration:
dagster api grpc \
  --host 0.0.0.0 \
  --port 4000 \
  --module my_dagster_project
Multiple code locations:
workspace.yaml
load_from:
  - grpc_server:
      host: analytics-code-server
      port: 4000
      location_name: "analytics"
  
  - grpc_server:
      host: ml-code-server
      port: 4000
      location_name: "machine_learning"
  
  - grpc_server:
      host: etl-code-server
      port: 4000
      location_name: "etl_pipelines"

Instance Configuration

The dagster.yaml file configures deployment-wide behavior. All services (webserver, daemon, code servers) must have access to this file.

Storage Configuration

dagster.yaml
run_storage:
  module: dagster_postgres.run_storage
  class: PostgresRunStorage
  config:
    postgres_db:
      hostname: postgres.example.com
      username:
        env: DAGSTER_POSTGRES_USER
      password:
        env: DAGSTER_POSTGRES_PASSWORD
      db_name:
        env: DAGSTER_POSTGRES_DB
      port: 5432

schedule_storage:
  module: dagster_postgres.schedule_storage
  class: PostgresScheduleStorage
  config:
    postgres_db:
      hostname: postgres.example.com
      username:
        env: DAGSTER_POSTGRES_USER
      password:
        env: DAGSTER_POSTGRES_PASSWORD
      db_name:
        env: DAGSTER_POSTGRES_DB
      port: 5432

event_log_storage:
  module: dagster_postgres.event_log
  class: PostgresEventLogStorage
  config:
    postgres_db:
      hostname: postgres.example.com
      username:
        env: DAGSTER_POSTGRES_USER
      password:
        env: DAGSTER_POSTGRES_PASSWORD
      db_name:
        env: DAGSTER_POSTGRES_DB
      port: 5432

Run Coordinator Configuration

The run coordinator manages run queuing and concurrency:
dagster.yaml
run_coordinator:
  module: dagster.core.run_coordinator
  class: QueuedRunCoordinator
  config:
    # Maximum runs executing simultaneously
    max_concurrent_runs: 10
    
    # Tag-based concurrency limits
    tag_concurrency_limits:
      # Limit runs with specific tags
      - key: "database"
        value: "postgres"
        limit: 3
      
      - key: "priority"
        value: "high"
        limit: 5
      
      # Apply limit to any value of this key
      - key: "team"
        limit: 2
    
    # Dequeue runs in order (FIFO)
    dequeue_use_threads: false
    
    # Number of seconds between dequeue attempts
    dequeue_interval_seconds: 5
Tag concurrency limits are useful for preventing resource contention. For example, limit concurrent runs accessing the same database to avoid overwhelming it.

Run Launcher Configuration

The run launcher determines how run workers are created:
dagster.yaml
run_launcher:
  module: dagster_docker
  class: DockerRunLauncher
  config:
    # Docker network for containers
    network: dagster_network
    
    # Environment variables to pass
    env_vars:
      - DAGSTER_POSTGRES_USER
      - DAGSTER_POSTGRES_PASSWORD
      - DAGSTER_POSTGRES_DB
      - AWS_ACCESS_KEY_ID
      - AWS_SECRET_ACCESS_KEY
    
    # Container configuration
    container_kwargs:
      volumes:
        - /var/run/docker.sock:/var/run/docker.sock
        - /tmp/dagster:/tmp/dagster
      
      # Resource limits
      mem_limit: "4g"
      cpu_quota: 200000  # 2 CPUs
      cpu_period: 100000

Scheduler Configuration

dagster.yaml
scheduler:
  module: dagster.core.scheduler
  class: DagsterDaemonScheduler
The DagsterDaemonScheduler is the only supported scheduler. Legacy schedulers have been deprecated.

Compute Log Storage

Compute logs (stdout/stderr from runs) can be stored in cloud storage:
dagster.yaml
compute_logs:
  module: dagster_aws.s3
  class: S3ComputeLogManager
  config:
    bucket: "my-dagster-logs"
    prefix: "dagster/logs"
    region: "us-west-2"
    
    # Use IAM role instead of credentials
    use_ssl: true
    
    # Local directory for buffering
    local_dir: "/tmp/dagster/logs"
    
    # Upload logs periodically during execution
    upload_interval: 30
    
    # Show URL only (don't download in UI)
    show_url_only: false
Without a compute log manager, run logs will only be available in the container/pod and will be lost when it terminates.

Job Execution Flow

Understanding the execution flow helps troubleshoot issues:
1

User initiates run

Via UI, CLI, schedule, or sensor. The request goes to the webserver.
2

Run coordinator queues run

The webserver calls the run coordinator, which:
  • Checks concurrency limits
  • Assigns run to queue
  • Returns immediately (non-blocking)
# Run coordinator logic (simplified)
if current_runs < max_concurrent_runs:
    if tag_limits_satisfied(run):
        queue.enqueue(run)
3

Daemon dequeues run

The daemon polls the queue every few seconds:
  • Dequeues runs respecting concurrency limits
  • Calls the run launcher for each dequeued run
# Daemon loop (simplified)
while True:
    runs = queue.dequeue_respecting_limits()
    for run in runs:
        run_launcher.launch_run(run)
    sleep(dequeue_interval_seconds)
4

Run launcher creates worker

The run launcher:
  • Creates a new Docker container or Kubernetes pod
  • Injects run ID and configuration
  • Starts the dagster process
# Example run worker command
dagster api execute_run \
  --run-id abc123 \
  --instance-ref {...}
5

Run worker executes job

The worker:
  • Loads the code location
  • Traverses the job graph
  • Executes each op/asset
  • Writes events to storage
  • Streams logs to compute log manager
6

Run completes

The worker:
  • Marks run as success or failure
  • Uploads final logs
  • Terminates

Production Architecture Patterns

Pattern 1: Single Cluster

All components in one Kubernetes cluster:
Components:
  - Webserver (3 replicas)
  - Daemon (1 replica)
  - Code Servers (N deployments)
  - PostgreSQL (RDS/CloudSQL)
  - Run Workers (ephemeral pods)

Pros:
  - Simple to manage
  - Low latency between components
  - Cost effective

Cons:
  - Runs compete with services for resources
  - Single point of failure

Pattern 2: Separate Run Cluster

Services and runs in different clusters:
Services Cluster:
  - Webserver (3 replicas)
  - Daemon (1 replica)
  - Code Servers (N deployments)

Run Cluster:
  - Run Workers (ephemeral pods)
  - Isolated resources
  - Can scale independently

Shared:
  - PostgreSQL (RDS/CloudSQL)
  - S3/GCS for logs and artifacts

Pros:
  - Run workloads don't affect services
  - Better resource isolation
  - Independent scaling

Cons:
  - More complex networking
  - Higher cost
  - Cross-cluster IAM/RBAC

Pattern 3: Multi-Region

Services in multiple regions:
Region 1 (Primary):
  - Webserver (3 replicas)
  - Daemon (1 replica)
  - Code Servers
  - PostgreSQL (primary)

Region 2 (DR):
  - Webserver (3 replicas, standby)
  - Code Servers
  - PostgreSQL (replica)

Pros:
  - High availability
  - Disaster recovery
  - Geographic distribution

Cons:
  - Complex setup
  - Database replication lag
  - Highest cost

Monitoring and Observability

Health Checks

# Kubernetes probes
livenessProbe:
  httpGet:
    path: /server_info
    port: 3000
  periodSeconds: 30
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /server_info
    port: 3000
  periodSeconds: 10
  failureThreshold: 2

startupProbe:
  httpGet:
    path: /server_info
    port: 3000
  periodSeconds: 10
  failureThreshold: 30

Metrics

Dagster exposes metrics via the event log:
import dagster as dg

@dg.asset
def monitored_asset(context: dg.AssetExecutionContext):
    # Log metrics
    context.log_event(
        dg.AssetMaterialization(
            asset_key="monitored_asset",
            metadata={
                "rows_processed": dg.MetadataValue.int(1000),
                "execution_time_seconds": dg.MetadataValue.float(45.2),
                "data_quality_score": dg.MetadataValue.float(0.95),
            },
        )
    )

Alerting

Configure sensors to alert on failures:
import dagster as dg

@dg.run_failure_sensor
def alert_on_failure(context: dg.RunFailureSensorContext):
    # Send alert to Slack, PagerDuty, etc.
    slack_client.post_message(
        channel="#data-alerts",
        text=f"Run failed: {context.dagster_run.job_name}",
    )

Security Considerations

Network Security

Kubernetes Network Policies:
  # Allow webserver to reach code servers
  - from: webserver
    to: code-servers
    port: 4000
  
  # Allow daemon to reach code servers
  - from: daemon
    to: code-servers
    port: 4000
  
  # Deny all other ingress
  - deny: all other traffic

Secrets Management

import dagster as dg
from dagster_aws.secretsmanager import aws_secretsmanager_resource

@dg.asset
def asset_with_secrets(
    context: dg.AssetExecutionContext,
    aws_secrets: aws_secretsmanager_resource,
):
    # Retrieve secrets at runtime
    api_key = aws_secrets.get_secret_value(
        secret_id="my-api-key"
    )

RBAC

# Kubernetes service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: dagster-run
  namespace: dagster
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: dagster-run-role
  namespace: dagster
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create", "get", "list", "delete"]

Next Steps