Appendix A: Installing and Configuring Skylar AI

Download this manual as a PDF file

The installation and configuration of Skylar AI and Skylar Advisor uses Harbor for the registry and Helm for deployment. The Skylar AI platform is deployed as a single Helm umbrella chart containing over 20 microservices, databases, and supporting infrastructure components.

This chapter covers how to deploy Skylar AI on a Kubernetes infrastructure using Harbor, Helm, and the user interfaces for Skylar AI and Skylar Advisor.

Prerequisites for Installing Skylar AI

This section describes the required GPU layout and Kubernetes configuration for deploying Skylar Advisor on an NVIDIA GPU–enabled Kubernetes cluster. It assumes basic familiarity with Kubernetes, Helm, and ConfigMaps.

GPU Cluster Requirements and Configuration

GPU Sizing Guide

The following GPU requirements are based on the size of your Skylar AI deployment:

  • Small deployments: 4 NVIDIA RTX 6000 GPUs
  • Medium deployments: 4 NVIDIA H200 GPUs
  • Large deployments: 8 NVIDIA H200 GPUs

NVIDIA GPU Operator

The NVIDIA GPU Operator manages the complete life cycle of GPU enablement on Kubernetes nodes, including NVIDIA drivers, container runtime integration, device plug-ins, GPU feature discovery, and monitoring components. Using the Operator eliminates the need to manually configure GPU nodes and ensures that GPUs are consistently exposed to Kubernetes workloads.

For more information, see the NVIDIA GPU Operator documentation: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/.

GPU Operator Deployment Summary

The following table lists the deployment details for the GPU Operator:

Item Value
Helm chart gpu-operator
Helm repository https://nvidia.github.io/gpu-operator
Name space gpu-operator
Container runtime containerd
NVIDIA driver version 570.195.03
MIG strategy mixed
MIG configuration External ConfigMap
Time-slicing configuration External ConfigMap
DCGM exporter Enabled

 

GPU Sharing Model: MIG + Time-slicing

Advisor uses Multi-Instance GPU (MIG) together with time-slicing to balance isolation, utilization, and performance:

  • MIG (Multi-Instance GPU) partitions a physical GPU into multiple hardware-isolated GPU instances, each with dedicated memory and compute resources.
  • Time-slicing oversubscribes a GPU or MIG instance by creating multiple schedulable replicas that share the same underlying GPU resources via time multiplexing.

This design allows Skylar Advisor to:

  • Reserve most GPUs for high-performance inference
  • Provide shared GPU capacity for OCR and document processing pipelines

Documentation

MIG: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/

Time-slicing: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html

MIG configuration

Apply one of the following ConfigMaps depending on whether the node uses H100 or H200 GPUs:

H100 MIG ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config-h100
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      custom-mig:
        # GPU 0: one large MIG slice
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "7g.80gb": 1

        # GPU 1: two medium MIG slices
        - devices: [1]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2

        # GPUs 2–7: full GPUs (no MIG)
        - devices: [2,3,4,5,6,7]
          mig-enabled: false

H200 MIG ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config-h200
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      custom-mig:
        # GPU 0: one large MIG slice
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "7g.141gb": 1

        # GPU 1: two medium MIG slices
        - devices: [1]
          mig-enabled: true
          mig-devices:
            "3g.71gb": 2

        # GPUs 2–7: full GPUs (no MIG)
        - devices: [2,3,4,5,6,7]
          mig-enabled: false

Time-slicing Configuration

Time-slicing is applied on top of the MIG devices to create additional schedulable replicas.

H100 Time-slicing ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config-h100
  namespace: gpu-operator
data:
  any: |
    version: v1
    flags:
      migStrategy: mixed
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/mig-7g.80gb
            replicas: 2
          - name: nvidia.com/mig-3g.40gb
            replicas: 4

H200 Time-slicing ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config-h200
  namespace: gpu-operator
data:
  any: |
    version: v1
    flags:
      migStrategy: mixed
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/mig-7g.141gb
            replicas: 2
          - name: nvidia.com/mig-3g.71gb
            replicas: 4

GPU Allocation Summary (8-GPU node)

┌──────────────────────────────────────────────────────────┐
│                    Kubernetes GPU Node                   │
│                                                          │
│  GPU 0  ── MIG enabled                                   │
│           └─ mig-7g.80gb  (H100)                         │
│              mig-7g.141gb (H200)                         │
│              ├─ time-slice replica 1 ── OCR              │
│              └─ time-slice replica 2 ── OCR              │
│                                                          │
│  GPU 1  ── MIG enabled                                   │
│           ├─ mig-3g.40gb  (H100)                         │
│           │   mig-3g.71gb (H200)                         │
│           │   ├─ 4× time-slice replicas ── document pods │
│           │                                              │
│           └─ mig-3g.40gb  (H100)                         │
│               mig-3g.71gb (H200)                         │
│               ├─ 4× time-slice replicas ── document pods │
│                                                          │
│  GPU 2 → 7  ── Full GPU ── Inference                     │
│                                                          │
└──────────────────────────────────────────────────────────┘

Troubleshooting

Verify NVIDIA Driver Version

To verify the NVIDIA driver installed by the GPU Operator:

nvidia-smi

Expected output includes:

Driver Version: 570.195.03

You can run this command directly on the node or from a GPU-enabled pod.

Seeding Model Weights

Skylar Advisor model weights must be copied into the shared skylar-ai-models RWX volume. This is done using a single Kubernetes pod with the AWS CLI.

Model seeding pod (single YAML):

apiVersion: v1
kind: Pod
metadata:
  name: seed-skylar-models
  namespace: default
spec:
  restartPolicy: Never
  containers:
    - name: aws-cli
      image: amazon/aws-cli:2
      command: ["/bin/sh", "-lc"]
      args:
        - |
          echo "Starting model sync..."
          aws s3 cp --recursive s3://<bucket>/<prefix>/ /data/skylar-ai-models/
          echo "Model sync complete."
      env:
        - name: AWS_ACCESS_KEY_ID
          value: "<ACCESS_KEY_PROVIDED_BY_SCIENCELOGIC>"
        - name: AWS_SECRET_ACCESS_KEY
          value: "<SECRET_KEY_PROVIDED_BY_SCIENCELOGIC>"
        # Optional if required
        # - name: AWS_DEFAULT_REGION
        #   value: "us-east-1"
      volumeMounts:
        - name: skylar-ai-models
          mountPath: /data/skylar-ai-models
  volumes:
    - name: skylar-ai-models
      persistentVolumeClaim:
        claimName: skylar-ai-models

Verify seeded data

After the pod completes:

kubectl logs seed-skylar-models

From any pod mounting the same volume:

ls -lah /data/skylar-ai-models

Kubernetes Cluster Requirements

  • Kubernetes: Version 1.32 or later
  • Storage: Configured StorageClass capable of dynamic PV/PVC provisioning. Storage cannot be solely NFS-backed, as this is not supported by ScienceLogic databases. ScienceLogic recommends using NFS or other RWX Storage configurations for the Skylar Advisor services.
  • Storage Capacity: Recommended 1 TB or more total storage capacity. This requirement varies by tenant datapoints per minute (DPM) requirements.
  • Networking: Currently, only IPv4 is supported. IPv6 is not currently supported.
  • Deployment Options:
  • Self-hosted Kubernetes clusters, such as bare metal or VMware
  • Cloud-managed Kubernetes services (EKS, GKE, AKS)

Third-Party Chart Dependencies

Skylar AI includes several third-party Helm charts from Bitnami and other providers:

  • Bitnami Charts: ClickHouse, PostgreSQL HA, and associated components
  • Maintenance Notice: ScienceLogic validates and updates third-party chart versions with each Skylar AI release. However, customers are responsible for:
  • Security patching of third-party components between Skylar AI releases
  • Vulnerability management for non-Skylar AI components
  • Understanding the security posture of included third-party dependencies

For a complete list of external dependencies and their versions, review the Chart.yaml file in the Skylar AI chart. You will download this file in step 5, below.

Required Infrastructure Dependencies

Ingress Controller

HTTP/HTTPS traffic routing and SSL termination:

  • Required: An ingress controller must be installed and configured
  • Recommended: ingress-nginx controller
  • Alternatives: AWS Load Balancer Controller, GKE Ingress, Azure Application Gateway, Traefik, or HAProxy Ingress

OpenTelemetry Operator

Observability data collection and processing:

  • Required: OpenTelemetry Operator with custom image
  • Image: registry.scilo.tools/skylar/sl-otelcol:0.16
# Install OpenTelemetry Operator with custom image
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
  --set "manager.collectorImage.repository=registry.scilo.tools/skylar/sl-otelcol" \
  --set "manager.collectorImage.tag=0.16" \
  --set admissionWebhooks.certManager.enabled=false \
  --set admissionWebhooks.autoGenerateCert.enabled=true

If you are using Cert-Manager for your certificate management, you can remove the following lines:

--set admissionWebhooks.certManager.enabled=false \
--set admissionWebhooks.autoGenerateCert.enabled=true

Recommended Infrastructure Components

Load Balancing

A load balancer solution is needed to distribute traffic across worker nodes:

  • Self-hosted: F5 BIG-IP, HAProxy with floating IP, or MetalLB
  • AWS: Application Load Balancer (ALB) or Network Load Balancer (NLB)
  • GCP: Google Cloud Load Balancer
  • Azure: Azure Load Balancer

DNS and TLS Requirements

Required:

  • FQDN: A fully qualified domain name pointing to the load balancer
  • TLS Certificate: Valid TLS certificate for the FQDN, provided as a Kubernetes secret

Optional Monitoring Integration

Prometheus-based Monitoring

  • Supported: Integration with existing Prometheus deployments
  • Benefits: Custom metrics export from Skylar AI services
  • Configuration: Skylar AI services expose Prometheus metrics endpoints that can be scraped for enhanced monitoring

Registry Access Setup

Step 1: Obtain Registry Credentials

  1. Navigate to the registry at https://registry.scilo.tools/. The Harbor landing page appears.
  2. Click Login via OIDC Provider.
  3. Click Customer Login.
  4. Enter your Salesforce credentials provided by ScienceLogic.
  5. Click your name in the top right corner and select User Profile.
  6. Click the copy icon in the CLI Secret field for use in the next steps.

Step 2: Sign into Helm Registry (Required)

You must authenticate with the Helm registry before proceeding with any chart operations. You authenticate with the registry using your Harbor CLI credentials (from the previous step):

helm registry login registry.scilo.tools \
  --username <your-harbor-username> \
  --password <your-cli-secret-from-step-1>

Step 3: Configure Kubernetes Registry Access

Create a Docker registry secret in your target namespace to enable image pulling:

# Create the deployment namespace
kubectl create namespace skylar-production

# Create Docker registry secret with Harbor CLI credentials
kubectl create secret docker-registry harbor-creds \
  --docker-server=registry.scilo.tools \
  --docker-username=<your-harbor-username> \
  --docker-password=<your-cli-secret-from-step-1> \
  --docker-email=<your-email> \
  --namespace=skylar-production

Step 4: Configure TLS Certificate (Optional)

You can skip this step if you have automated certificate management (such as cert-manager) or deployment processes that handle TLS configuration.

Create a TLS secret with your provided certificate and private key:

kubectl create secret tls skylar-tls-secret \
  --cert=path/to/your/certificate.crt \
  --key=path/to/your/private.key \
  --namespace=skylar-production

Step 5: Download the Skylar AI Chart

Use Helm to pull the Skylar AIChart.yaml file from the registry (registry login required from Step 2), and then decompress the charts:

helm pull oci://registry.scilo.tools/skylar/skylar-charts
tar -xvf skylar-charts-x.x.x.tgz

Installation Process

Step 1: Environment Configuration

Create an environment-specific override file to customize the deployment. This file should be developed in collaboration with ScienceLogic Engineering to ensure proper configuration for your environment.

Example override.yaml structure:

# File: /path/to/override.yaml
global:
  # Registry configuration
  registry:
    hostname: "registry.scilo.tools"
    username: "<your-harbor-username>"
    password: "<your-cli-secret-from-step-1>"
  
  # Image pull secrets reference
  imagePullSecrets:
    - name: harbor-creds
  
  # Kubernetes cluster IP range. Used for network controls on database access in dataviz clickhouse
  kubernetes_cidr_ranges: 100.64.0.0/16
  
  # Ingress configuration
  ingress:
    hostname: "skylar.yourdomain.com"  # Your provided FQDN
    protocol: "https"
    className: "nginx"  # Adjust based on your ingress controller
    tls:
      - secretName: skylar-tls-secret  # Your TLS certificate secret
        hosts:
          - "skylar.yourdomain.com"
  
  # Enable/disable platform components
  enablePlatform: true
  enableAnalytics: true
  enableMonitoring: false  # Set to true if integrating with existing Prometheus


# Component-specific configurations
clickhouse:
  persistence:
    size: "500Gi"

postgresql-ha:
  postgresql:
    persistence:
      size: "100Gi"



    

Step 2: Skylar Advisor Configuration

If you enabling Skylar Advisor, you will want to be aware of the following requirements:

  • A GPU sizing file is necessary. Work with ScienceLogic to determine the best configuration.
  • Skylar Advisor requires a storageClass that allows RWX, because Skylar Advisor requires a shared filesystem across services.

For Skylar Advisor, you will need to add the following settings to the override.yaml file:

global:
  enableAdvisor: true  
  enableAdvisorOCR: true  


#If enabling Skylar Advisor then set StorageClass which allows RWX.
skylar-advisor-api:
  persistence:
    storageClass:
    

Ingress Controller-Specific Annotations

Skylar AI includes default ingress annotations optimized for NGINX Ingress Controller. These annotations handle proxy buffering, timeouts, and request size limits that are essential for operating Skylar AI:

Skylar AI recommends the following ingress annotations optimized for NGINX Ingress Controller. These annotations handle proxy buffering, timeouts, and request size limits that are essential for Skylar AI operation:

nginx.org/client-max-body-size: 256m

nginx.org/proxy-buffer-size: 512k

nginx.org/proxy-buffering: "on"

nginx.org/proxy-buffers: "4 512k"

nginx.org/proxy-max-temp-file-size: 1024m

nginx.org/proxy-read-timeout: "300s"

Skylar Advisor Only:

nginx.org/proxy-hide-headers: "Content-Security-Policy" 
nginx.org/location-snippets: |
  add_header Content-Security-Policy "frame-ancestors *;" always;

You will need to enable snippets to allow the Skylar Advisor config: https://docs.nginx.com/nginx-ingress-controller/configuration/ingress-resources/advanced-configuration-with-snippets/.

For Non-NGINX Ingress Controllers, you will need to override these annotations with equivalent configurations for your ingress controller.

Make sure that your ingress controller configuration supports:

  • Large request body sizes (256 MB minimum)
  • Extended read timeouts (300 seconds minimum)
  • Proper proxy buffering for large responses

ScienceLogic is actively working on reducing these requirements with upcoming releases.

Step 3: Deploy the Skylar AI Platform

# Deploy skylar with scaling profile and custom overrides
helm upgrade --install skylar-prod \
  oci://registry.scilo.tools/skylar/skylar-charts \
  --namespace skylar-production \
  --values envs/scaling/small.yaml \
  --values /path/to/override.yaml

  #If enabling advisor add
  --value /path/to/GPU_Scaling_File.yaml 

Scaling Profiles

Choose the appropriate scaling profile based on your environment and datapoints per minute (DPM). These scaling profiles can be located within the downloaded Helm chart:

  • envs/scaling/small.yaml: Development/testing environments (0-30,000 DPM)
  • envs/scaling/medium.yaml: Small production deployments (30,000-215,000 DPM)
  • envs/scaling/large.yaml: Medium production deployments (215,000-300,000 DPM)
  • envs/scaling/xlarge.yaml. Large production deployments (300,000 - 900,000 DPM)

Cloud-Specific Considerations

AWS Deployments

# Additional AWS-specific configuration
global:
  ingress:
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
      service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"

# Storage configuration
clickhouse:
  persistence:
    storageClass: "gp3"

postgresql-ha:
  postgresql:
    persistence:
      storageClass: "gp3"

GCP Deployments

# Additional GCP-specific configuration
global:
  ingress:
    annotations:
      kubernetes.io/ingress.class: "gce"
      kubernetes.io/ingress.global-static-ip-name: "skylar-ip"

# Storage configuration
clickhouse:
  persistence:
    storageClass: "pd-ssd"

postgresql-ha:
  postgresql:
    persistence:
      storageClass: "pd-ssd"

Azure Deployments

# Additional Azure-specific configuration
global:
  ingress:
    annotations:
      kubernetes.io/ingress.class: "azure/application-gateway"

# Storage configuration
clickhouse:
  persistence:
    storageClass: "managed-premium"

postgresql-ha:
  postgresql:
    persistence:
      storageClass: "managed-premium"

Validation and Access

Verify Deployment Status

# Check all pods are running
kubectl get pods -n skylar-production

# Verify ingress configuration
kubectl get ingress -n skylar-production

# Check TLS certificate is properly configured
kubectl describe ingress -n skylar-production

# Check image pull secrets are working
kubectl describe pod <any-skylar-pod> -n skylar-production | grep -A5 "Events:"

Access the Platform

  1. Verify DNS Resolution:

    nslookup skylar.yourdomain.com

  2. Navigate to the Skylar AI user interface using your provided FQDN, such as https://skylar.<yourdomain>.com.

  3. Log in for the first time with the default email of skylar@sciencelogic.com.

  4. Set a password for your first login.

Monitoring Integration (Optional)

Prometheus Integration

If you have an existing Prometheus setup, you can configure it to scrape metrics from Skylar AI services:

# In your override.yaml - enables metrics endpoints
global:
  enableMonitoring: true

This will expose Prometheus metrics endpoints on Skylar services that can be scraped by your existing Prometheus deployment. Configure your Prometheus to discover and scrape these endpoints based on your service discovery method, such as Kubernetes service discovery or static configurations.

You can ingest your scraped metrics from Prometheus into Skylar One leveraging the "SL1 Prometheus" PowerPack.

Troubleshooting

Common Issues

Image Pull Failures

  • Verify registry secret is correctly configured.
  • Check CLI secret is still valid in Harbor.
  • Ensure namespace has access to the registry secret.
  • Verify Helm registry login was successful.

TLS Certificate Issues

  • Verify the TLS secret contains valid certificate and key.
  • Check certificate matches the FQDN.
  • Ensure certificate is not expired.

DNS Resolution Issues

  • Verify FQDN points to the correct load balancer IP.
  • Check DNS propagation if recently updated.

Ingress Controller Issues

  • Verify ingress controller is running and healthy.
  • Check ingress controller logs for errors.
  • Ensure ingress class name matches your controller.
  • Verify ingress annotations are compatible with your controller.

Pod Startup Failures

  • Check resource constraints and storage availability.

Database Connection Issues

  • Ensure ClickHouse and PostgreSQL pods are healthy.

Large Upload/Download Issues

  • Verify ingress controller supports large request bodies (256 MB and up).
  • Check proxy timeout configurations.
  • Ensure proper buffering settings are applied.

Cloud-Specific Issues

AWS

  • Check IAM permissions for EBS/EFS access

GCP

  • Verify service account permissions for persistent disks

Azure

  • Ensure proper RBAC for storage resources

Useful Commands

Deployment Status Commands

# Check deployment status
helm status skylar-prod -n skylar-production

# View pod logs for troubleshooting
kubectl logs <pod-name> -n skylar-production

# Describe pod for detailed information
kubectl describe pod <pod-name> -n skylar-production

Secret Verification Commands

# Verify registry secret
kubectl get secret harbor-creds -n skylar-production -o yaml

# Verify TLS secret
kubectl get secret skylar-tls-secret -n skylar-production -o yaml

# Check certificate details
kubectl get secret skylar-tls-secret -n skylar-production -o jsonpath='{.data.tls\.crt}' | 
base64 -d | openssl x509 -text -noout

Line breaks were added to some of the lines of code, above, to allow the code sample to display properly.

Infrastructure Verification Commands

# Check ingress controller status
kubectl get pods -n ingress-nginx  # Adjust namespace based on your setup

# Check persistent volume claims
kubectl get pvc -n skylar-production

# Test registry connectivity
kubectl run test-registry --image=registry.scilo.tools/skylar/skylar-charts:latest \
  --dry-run=client -o yaml | kubectl apply -f -

Registry Authentication Issues

If you encounter image pull errors:

  • Verify CLI Secret: Ensure the CLI secret has not expired in Harbor.
  • Test Registry Login: Re-authenticate with helm registry login.
  • Check Secret Format: Verify the Kubernetes secret was created correctly.
  • Namespace Access: Ensure the secret exists in the correct namespace.

TLS Certificate Issues

If you encounter TLS-related problems:

  • Certificate Validation: Verify the certificate is valid for your FQDN.
  • Certificate Format: Ensure the certificate is in PEM format.
  • Certificate Chain: Include intermediate certificates if required.
  • Private Key Match: Verify the private key matches the certificate.

Ingress Controller Issues

If ingress resources are not working:

  • Controller Status: Verify the ingress controller is running.
  • Class Name: Ensure the ingress class name matches your controller.
  • Controller Logs: Check ingress controller logs for configuration errors.
  • Service Endpoints: Verify backend services are healthy and have endpoints.
  • Annotation Compatibility: Ensure ingress annotations are supported by your controller.

Security Considerations

Security Best Practices:

  • Registry Credentials: Store CLI secrets securely and rotate them regularly.
  • TLS Certificates: Ensure certificates are from trusted CAs and renewed before expiration.
  • Network Policies: Consider implementing Kubernetes network policies to restrict inter-pod communication.
  • RBAC: Configure appropriate role-based access controls for the Skylar AI namespace.
  • Ingress Security: Configure appropriate security headers and rate limiting on your ingress controller.
  • Secrets Management: Use external secret management solutions, such as HashiCorp Vault, or AWS Secrets Manager for production.
  • Cloud Security: Follow cloud provider security best practices.
  • Third-Party Components: Monitor security advisories for included third-party components.

Third-Party Security Responsibilities

While ScienceLogic validates and updates third-party chart versions with each Skylar AI release, customers should:

  • Monitor Security Advisories: Stay informed about security issues in included third-party components.
  • Plan for Updates: Be prepared to upgrade Skylar AI when security patches are available.
  • Vulnerability Assessment: Include third-party components in security scanning and assessment processes.
  • Risk Management: Understand the security posture of all included dependencies.

Backup and Recovery

Critical Data Components

Skylar AI stores critical data in the following persistent volumes, which must be backed up regularly.

ClickHouse Data Volumes

  • Contains analytics data, metrics, and time-series information.
  • Recommended backup frequency: Daily with 30-day retention.

PostgreSQL Data Volumes

  • Contains application metadata, user data, and configuration.
  • Recommended backup frequency: Daily with 30-day retention.

Backup Strategy Options

Volume-Level Backups

  • Use cloud provider snapshot capabilities, such as AWS EBS, GCP Persistent Disks, or Azure Disks.
  • Implement Kubernetes volume snapshots using CSI drivers.
  • Consider third-party backup solutions like Velero for comprehensive cluster backup.

Application-Level Backups

  • Database-native backup tools for ClickHouse and PostgreSQL.
  • Export application configuration and secrets.
  • Backup Helm chart values and deployment configurations.

Hybrid Approach

  • Combine volume snapshots for fast recovery with application-level backups for granular restore options.
  • Implement both local and off-site backup storage for disaster recovery.

Backup Verification

Regular Testing Requirements

  • Monthly: Test backup restoration procedures in non-production environment.
  • Quarterly: Full disaster recovery simulation.
  • Document and validate recovery time objectives (RTO) and recovery point objectives (RPO).

Monitoring and Alerting

  • Monitor backup job completion and success rates.
  • Alert on backup failures or missing backups.
  • Verify backup accessibility and integrity.

Configuration Backup

Helm and Kubernetes Configuration

  • Export and store Helm values files.
  • Backup Kubernetes secrets and ConfigMaps.
  • Maintain version-controlled infrastructure as code.

Store configuration backups containing secrets in encrypted storage with restricted access. Never commit secrets to version control systems.

For detailed backup implementation guidance, consult your cloud provider's backup documentation and the Kubernetes Backup Best Practices.

Support

For deployment assistance, configuration guidance, or troubleshooting support, contact Skylar AI Enablement. They can provide:

  • Environment-specific override file templates.
  • Scaling recommendations based on your requirements.
  • Custom configuration for enterprise integrations.
  • Cloud-specific deployment guidance.
  • Registry access troubleshooting.
  • TLS certificate configuration assistance.
  • Ingress controller configuration guidance.
  • Post-deployment optimization.
  • Third-party component guidance (limited to integration aspects only).

Support for third-party components, such as Bitnami charts, is limited to integration and configuration guidance. For issues specific to these components, consult their respective documentation and support channels.