Explore the future of AI-Native Data Management at Autonomous 26 | May 19 --> Save your spot
Acceldata Launches Autonomous Data & AI Platform for Agentic AI Era. Learn More →

Spot Instances and Spark: How to Run Reliably Without Paying On-Demand Prices

June 25, 2026
10 minute

In January 2026, AWS added Spot interruption metrics to EC2 Capacity Manager. These include interruption counts, rates, and patterns across regions and AZs. The message is clear: running Spot at scale requires measurement along with configuration.

For Spark on EKS, that logic goes deeper. Instance-level metrics tell you that capacity was reclaimed, but not which executor pods went down, which stages were mid-flight, or whether the retry completed.

Spot instances are available at discounts of up to 90% off on-demand pricing, with fewer than 5% of instances interrupted before the customer terminates them. Here, you'll find the configuration settings, architectural patterns, and observability layer that help teams reliably capture Spot savings.

What EC2 Spot Instances Are and Why They Matter for Spark

EC2 Spot instances give you access to spare AWS compute capacity at steep discounts. However, AWS can reclaim that capacity at any time, with a two-minute interruption notice before terminating your instance.

Spark is ideal for spot-based infrastructure because of how it handles failure; it can recompute lost partitions when a node disappears. This is why Spark workloads are more tolerant of interruption than stateful systems.

What makes Spark perfect for Spot deployments:

  • Batch jobs are stateless at the task level and can be retried.
  • Executor loss doesn't cascade unless shuffle data is involved.
  • Most pipeline SLAs can absorb a few retries if the platform is configured correctly.

Actual savings numbers will vary by instance type, availability zone, and region. AWS Spot capacity shifts constantly, so what works reliably in one availability zone (AZ) may not in another.

Spot executor pools are the highest-leverage cost cut available to data engineering teams on EKS, as long as you handle interruptions correctly.

What Spot Interruptions Actually Do to a Spark Job

Spark's fault tolerance doesn't kick in automatically. You have to know the sequence in which an interruption unfolds in order to configure guardrails before something breaks in production.

AWS issues an interruption notice via EventBridge or instance metadata. Two minutes, best effort, not guaranteed. Kubernetes marks the node not-ready; pods get evicted or terminated.

If the node disappears abruptly, the graceful window goes with it. By the time Spark registers the executor failures, you're already behind.

What makes the shuffle expensive

Most executor losses are recoverable; Spark recomputes from lineage and moves on. It gets expensive when the interrupted executor was carrying shuffle output.

Those tasks can't be recomputed. They trigger fetch failures, which snowball into stage retries. If your retry limits are tight or the stage has already run for a while, you're dealing with a job failure.

Buying time: Spark's decommissioning settings

If AWS gives you the two-minute window, Spark can use it, but only if you've set it up to.

Three settings you should pay attention to:

  • spark.decommission.enabled=true : tells Spark to begin a graceful executor exit
  • spark.storage.decommission.enabled=true with spark.storage.decommission.shuffleBlocks.enabled=true : migrates RDD and shuffle blocks off the executor before it shuts down
  • spark.kubernetes.dynamicAllocation.deleteGracePeriod : gives the pod time to complete decommissioning before Kubernetes terminates it

None of this helps if the node vanishes before Spark can act. When that happens, Spark falls back to recomputation. The fallback is expensive on longer jobs and a real SLA risk on late-stage shuffles.

For platform engineering teams, the difference between a clean retry and a blown SLA usually comes down to whether the interruption signal arrived in time to act on.

What Visibility Into Spot Interruptions Actually Requires

The correct Spark configuration is necessary but not sufficient. You get savings from running Spark on Kubernetes only if you connect what happened at the EC2 layer to what broke at the Spark layer. Otherwise, your reliability posture and FinOps decisions are both down to guesswork.

A functional visibility stack needs these three things working together:

  1. Event capture: EC2 EventBridge Spot interruption warnings or IMDS polling at /latest/meta-data/spot/instance-action
  2. Node and pod lifecycle: Kubernetes node and pod conditions; note that Kubernetes Events expire after one hour by default and won't be there post-incident.
  3. Spark runtime context: Executor failure logs, shuffle fetch failures, and stage retry events tied to a specific job and run.

Kubernetes events alone don't cut it

Even if you catch an event in time, you still need to join it to the executor pod that died, the node it ran on, and the Spark job that was affected. That doesn't happen automatically from raw Kubernetes events. AWS Node Termination Handler helps with cordon and drain, but gives nothing at the Spark layer.

Meaningful data observability across your Spark infrastructure requires a control plane that brings together EC2 events, Kubernetes pod lifecycles, and Spark executor contexts into a single view. Acceldata xLake delivers this via its pod-level observability layer. It correlates driver and executor logs, OOMKills, Spot evictions, and scheduling failures, so you can trace an eviction to the jobs it hit.

Architectural Patterns That Improve Spot Reliability

This is the operational architectural baseline for running Spark reliably on Spot at production scale. These are not complex in isolation, but the value comes from applying them together.

Keep drivers and executors on separate node pools

Run dedicated Spot and On-Demand node pools in EKS. Direct stateless, retryable Spark executors to the Spot pool. Keep your Spark driver on On-Demand. EKS-managed node groups support both purchasing options. Enforce placement via node selectors or tolerations.

Set CapacityRebalance to true on your managed node groups. If you don’t, the drain process after a Spot interruption or rebalance notification doesn't execute correctly, and your pods can’t gracefully terminate.

Diversify your instance types

Don't anchor on a single instance family. Spot capacity varies by instance type and AZ. So, if one family loses capacity, your entire executor pool can be interrupted at once. Use Karpenter with diverse instance type selection to minimize the chances of that.

Karpenter also cordons, taints, and drains affected nodes ahead of the interruption, giving pods the best chance at a clean exit.

Note: AWS recommends against running Karpenter's interruption handling alongside AWS Node Termination Handler in the same cluster. Pick one. Configure it consistently.

Checkpoint long-running jobs

Are you managing Spark jobs that run longer than typical interruption windows? In that case, checkpointing the intermediate state to S3 limits how much work gets lost per interruption. This applies especially to streaming jobs and multi-hour batch pipelines where recomputing from scratch is expensive.

Test before trust

AWS provides an EC2 Spot interruption testing capability built on AWS Fault Injection Service. It issues a real interruption notice before terminating an instance.

Use this to validate your decommissioning configuration against actual Spark jobs in staging. A configuration that looks right on paper often behaves differently under real interruption conditions.

How to Think About Spot Cost Optimization Without Sacrificing SLAs

Is your architecture sound and your visibility layer in place? Now ask: how much Spot can you actually run?

The right spot-to-On-Demand ratio for any workload depends on observed interruption rates, job duration, and how close your retry behavior gets to spark.executor.maxNumFailures ceilings under real conditions. You can only answer this with eviction history.

What to track Why it matters
Interruption frequency per node group and instance type Identifies which instance families are highest risk in your AZs
Retry rates and stage resubmissions per Spark application Shows which jobs absorb interruptions cleanly vs. hit retry limits
Job completion time on Spot vs. On-Demand baseline Surfaces the actual SLA cost of spot-induced retries

The eviction history problem

Kubernetes Events expire after one hour by default. If your observability layer isn't capturing and persisting interruption events outside the API server, you have no historical basis for aligning your Spot mix. You have to make infrastructure budget decisions without data to support them.

FinOps teams focused on controlling data infrastructure spend need a durable eviction history to make confident Spot mix decisions. The goal is to run as much capacity on Spot as your SLAs can absorb, and you have to measure it to know what that capacity is.

Acceldata's cost optimization capabilities flag infrastructure spend patterns and reliability signals in one view. It also offers anomaly detection across cost and reliability patterns to catch unexpected shifts in interruption frequency before they magnify into SLA incidents.

Spot Savings Are Real, If Your Platform Can See the Interruptions

Running Spark on EC2 Spot instances will certainly control infrastructure costs on EKS. Spark's lineage-based recovery, executor decommissioning, and instance diversification give you a decent fault-tolerance baseline.

To get those savings, you need to know when a Spot node was interrupted, which executor pods went down, and which Spark runs were affected. You also need to do this faster than Kubernetes Events expire, and in a format your team can query later.

Acceldata xLake highlights Spot eviction events alongside driver and executor context in a single control plane. Platform teams can trace each EC2 interruption to its job-level impact, without hopping between CloudWatch, Kubernetes logs, and Spark History Server.

Pair that with the Data Pipeline Health Agent for pipeline monitoring. You’ve now closed the loop between infrastructure events and data reliability outcomes.

Acceldata xLake makes spot-based Spark infrastructure manageable. Let us show you how. Book a demo.

Spark on Spot Instances: Frequently Asked Questions

Can Spark run reliably on EC2 Spot instances?

Yes, Spark can do so with the right setup. Spark recomputes lost RDD partitions from lineage after executor loss, so it is more interruption-tolerant than most distributed systems. But shuffle-heavy jobs need to keep decommissioning enabled to manage the retry blast radius.

What happens to a Spark job when a Spot instance is interrupted?

Affected executors are lost. Spark retries tasks using lineage. If shuffle data is lost, fetch failures can trigger stage retries or full stage failures. But this depends on your retry configuration and how far the stage progressed at the time of interruption.

How do I detect Spot interruptions in a Spark-on-Kubernetes environment?

Consume EventBridge Spot interruption warnings or poll IMDS at /latest/meta-data/spot/instance-action. AWS Node Termination Handler will route those signals into Kubernetes cordon and drain operations automatically. But it will not give you Spark-layer impact visibility.

What is the best Spark configuration for Spot instance reliability?

Enable spark.decommission.enabled, spark.storage.decommission.enabled, and spark.storage.decommission.shuffleBlocks.enabled.

Add spark.kubernetes.dynamicAllocation.deleteGracePeriod for Kubernetes-native pod lifecycle handling. Test these settings against real interruptions using AWS Fault Injection Service before deploying to production.

How does instance diversification reduce Spot interruption risk on EKS?

Spot capacity varies by instance type and AZ. Using multiple instance families via Karpenter makes it less likely that your entire executor pool loses capacity simultaneously, even if one instance family is reclaimed.

About Author

Shreya Bose

Similar posts