Acceldata Launches Autonomous Data & AI Platform for Agentic AI Era. Learn More →

Explore the future of AI-Native Data Management at Autonomous 26 | May 19 --> Save your spot

Why OOMKilled Spark Executors on Kubernetes Are Harder to Diagnose Than They Should Be

May 28, 2026

10 minute

OOMKilled events terminate Spark executors at the Kubernetes layer before Spark can log them. The result is a visibility gap that standard Spark tooling cannot close on its own. This article explains why diagnosis becomes fragmented across tools and what a complete observability stack requires.

Your Spark job fails at 2 am. Executor counts drop. The Spark UI shows ExecutorLostFailure with no stack trace, no OutOfMemoryError, and no obvious root cause. The History Server only confirms that a Spark executor disappeared.

The problem is that Spark on Kubernetes OOMKilled events happen at the Kubernetes layer, not inside Spark itself. The actual evidence lives in pod Events, container state, and node-level logs that Spark tooling cannot surface on its own.

This article explains why Spark OOMKilled diagnosis becomes fragmented across tools and what a complete visibility stack requires.

What OOMKilled Actually Means in a Kubernetes Context

When Spark runs on Kubernetes, every executor runs inside a pod with a defined memory limit. If the container crosses that limit, Kubernetes terminates it immediately and marks the pod state as OOMKilled.

The failure happens at the infrastructure layer, not inside Spark itself.

Kubernetes sends a SIGKILL to the container process, leaving the JVM no time to flush logs, dump stack traces, or shut down cleanly. The executor disappears almost instantly. From Spark's perspective, it simply stopped responding.

The driver then registers an ExecutorLost event, but the actual cause never reaches Spark's event logs. Instead of a Java OutOfMemoryError or a stack trace pointing to memory exhaustion, the Spark UI only shows a lost executor with little context around why it vanished.

The real evidence sits in Kubernetes:

kubectl describe pod
pod Events
kubelet logs
node-level kernel logs

That is where the Linux OOM killer leaves its trail, typically as Reason: OOMKilled and exit code 137, outside Spark's native visibility surface.

Why the Spark History Server Doesn't Tell You What Actually Happened

The Spark History Server only knows what Spark itself observed. During a Spark OOMKilled event, it rebuilds its view from Spark event logs: stage completion, task failures, executor changes, and retries. Kubernetes container termination details never enter that log stream.

So when an executor crosses its memory limit, Spark only records an ExecutorLostFailure. The UI can show that an executor vanished during a stage, but it cannot explain why.

The missing evidence lives outside Spark:

kubectl describe pod
pod Events
kubelet logs
Linux kernel logs

That is where you see:

Reason: OOMKilled
exit code 137
node-level memory pressure signals

None of this appears in the History Server. Spark exposes the executor failure, while Kubernetes retains the infrastructure-level termination details.

The table below maps where OOMKilled evidence appears and where it remains invisible across the stack:

Layer / Tool	What You Typically See	OOMKilled Root Cause Evidence Present?
Spark UI (driver Web UI)	Executors list, stages, tasks, memory summaries	Not inherently; Spark records Spark events, not Kubernetes pod events
Spark History Server	Reconstructed UI from Spark event logs	Not inherently; bounded by what Spark logged
kubectl describe pod	Container state (Terminated), reason, exit code	Yes (Reason: OOMKilled, exit code 137)
kubectl get events --sort-by=.lastTimestamp	Chronological events, including OOM-related activity	Yes (event trail around the kill)
Node logs (kubelet + kernel)	Kernel OOM killer messages, kubelet kill notices	Yes (lowest-level confirmation)

The Multi-Tool Gap That Slows Diagnosis

Finding the OOMKilled evidence is only part of the problem. Diagnosing the full incident means correlating signals across multiple tools, each exposing a different layer of the failure.

A typical investigation starts in the Spark UI after an executor disappears. Engineers then pivot into kubectl describe pod to confirm whether Kubernetes killed the container, check Events to align timestamps, retrieve pre-crash logs with kubectl logs --previous, and inspect memory pressure through kubectl top or direct cgroup reads.

The investigation continues into Spark memory overhead settings, container limits, and tmpfs configurations to determine whether sizing contributed to the kill. If multiple spark executor instances fail together, teams end up reviewing kubelet and kernel logs for node-level memory-pressure signals.

The Kubernetes troubleshooting sequence itself reflects this fragmented workflow. Every context switch adds delay and correlation overhead, forcing teams to reconstruct a single failure timeline across disconnected tools.

The table below maps each diagnosis step to the primary tool it requires.

Step	Question Answered	Tool Required
1. Notice failure	Which executor died? Which stages were affected?	Spark UI / History Server
2. Confirm Kubernetes kill vs. app exception	Did Kubernetes terminate the container?	kubectl describe pod
3. Find timeline and cluster-wide context	Did multiple pods OOMKill around the same time?	kubectl get events --sort-by=.lastTimestamp
4. Check pre-death logs	Are there relevant logs before the container was killed?	kubectl logs --previous
5. Validate memory pressure	Was usage approaching the limit before the kill?	kubectl top pod or /sys/fs/cgroup/memory.current
6. Review Spark sizing inputs	Are memory overhead and tmpfs settings consistent with container limits?	Spark-on-Kubernetes configs
7. Node-level forensics	Was there node-wide memory pressure or kernel OOM activity?	kubelet logs + kernel logs

What Your Visibility Stack Is Missing

Observability platforms are systems that ingest logs, metrics, events, and traces across infrastructure layers. In OOMKilled investigations, those signals already exist across the stack. The challenge is correlating them across disconnected tools while preserving executor, infrastructure, and timeline context.

Closing the gap requires four capabilities working together in a single operational surface.

Pod event correlation: Container termination reason, exit code, and timestamps need to map directly to the affected Spark executor. Without that linkage, it becomes difficult to separate Kubernetes kills from application-level failures.
Executor lifecycle mapping: Teams need visibility into which stage was running, which tasks were in flight, and how executor loss impacted downstream job execution.
Memory trend history: A post-mortem kubectl describe pod only shows the final state. Teams need historical and real-time visibility into how container memory is tracked against limits over time, including signals like kube_pod_container_status_last_terminated_reason.
Proactive alerting: Effective alerting watches memory pressure build before Kubernetes terminates the pod.

xLake's pod-level observability surfaces Kubernetes infrastructure health, pod signals, logs, resource usage, and Spark job activity in a unified operational view, reducing the context switching required during OOMKilled investigations.

Preventing OOMKilled Recurrence Beyond the Immediate Fix

Once teams can reliably trace OOMKilled events back to Kubernetes, the next step is preventing the same failure pattern from repeating.

One overlooked factor in Spark on Kubernetes environments is memory overhead sizing. Executor containers need memory beyond the JVM heap for off-heap allocations, system processes, non-JVM tasks, and tmpfs-backed local storage.

When spark.kubernetes.local.dirs.tmpfs=true is enabled, tmpfs mounts consume pod memory directly, even though Spark-side memory metrics do not fully reflect it. This creates executor and spark-py-driver OOMKilled conditions that appear unexplained inside the Spark UI.

The signals worth monitoring continuously include:

container memory usage versus limits
Kubernetes pod termination reasons
spark.kubernetes.memoryOverheadFactor sizing
alignment between resource requests and limits

Once a container crosses its memory limit, Kubernetes terminates it. Preventing recurrence means closing the gap between Spark's expected memory usage and the memory the container can actually consume.

Ready to Diagnose Spark Issues Before They Escalate?

See How

OOMKilled Is a Symptom. Visibility Is the Cure.

Spark OOMKilled events originate at the Kubernetes layer, below Spark's native visibility surface. Spark can show that an executor disappeared, but the actual termination evidence lives in pod state, Events, and node-level infrastructure logs.

That visibility gap forces teams to manually correlate Spark and Kubernetes signals during every investigation.

xLake's pod-level observability brings infrastructure signals, pod health, logs, resource usage, and Spark job activity into a unified operational view — reducing the context switching that slows OOMKilled diagnosis and root cause analysis.

Stop losing hours to OOMKilled mysteries — see how xLake unifies Spark-on-Kubernetes visibility. Book a demo at acceldata.io.

OOMKilled Spark Executors: Frequently Asked Questions

What causes OOMKilled errors in Spark executors on Kubernetes?

OOMKilled occurs when a container exceeds its memory limit and the Linux kernel OOM killer terminates it. Common indicators include Reason: OOMKilled and exit code 137 in kubectl describe pod. Non-JVM overhead, including tmpfs local directories, can contribute significantly.

Why doesn't the Spark UI show the reason for an OOMKilled executor?

Spark's UI and History Server reconstruct views from Spark event logs. Kubernetes container termination reasons are outside that log scope, so the UI records executor loss without surfacing the underlying kill reason.

How do I find OOMKilled events for a Spark job on Kubernetes?

Run kubectl describe pod <pod-name> -n <namespace> and check Container Statuses for Reason: OOMKilled. Then run kubectl get events --sort-by=.lastTimestamp to trace the chronological event trail around the kill.

How is a Spark executor OOMKill different from a Java OutOfMemoryError?

OOMKilled is a container-level termination by the Linux kernel; no Java exception is generated. A Java OutOfMemoryError is a JVM-level event with a stack trace. Spark records the OOMKill only as executor loss, not as a Java error.

What is the best way to prevent OOMKilled events in Spark on Kubernetes?

Continuously monitor pod memory usage against limits through kubectl top pod or cgroup reads. Review spark.kubernetes.memoryOverheadFactor and tmpfs local directory settings regularly, and treat resource limit alignment as an ongoing configuration practice rather than a post-incident fix.

‍

About Author

Why OOMKilled Spark Executors on Kubernetes Are Harder to Diagnose Than They Should Be

What OOMKilled Actually Means in a Kubernetes Context

Why the Spark History Server Doesn't Tell You What Actually Happened

The Multi-Tool Gap That Slows Diagnosis

What Your Visibility Stack Is Missing

Preventing OOMKilled Recurrence Beyond the Immediate Fix

OOMKilled Is a Symptom. Visibility Is the Cure.

OOMKilled Spark Executors: Frequently Asked Questions

What causes OOMKilled errors in Spark executors on Kubernetes?

Why doesn't the Spark UI show the reason for an OOMKilled executor?

How do I find OOMKilled events for a Spark job on Kubernetes?

How is a Spark executor OOMKill different from a Java OutOfMemoryError?

What is the best way to prevent OOMKilled events in Spark on Kubernetes?

Srijan Sharma

Similar posts

Sonam Jain

ServiceNow Data Catalog Integration: Available in ADOC 26.6.0

Sonam Jain

Data Products: Now Available in ADOC 26.5.0

Shubham Thakur

OpenLineage Support: Expanded Platform Coverage Across Redshift, Glue, Pub/Sub, and Iceberg