Skip to content

Commit 62e0066

Browse files
authored
fix(cli): fail attempt on uncaught exception instead of hanging to maxDuration (TRI-9117) (#3529)
When a Node EventEmitter (e.g. node-redis) emits an "error" event with no listener attached, Node escalates it to process.on("uncaughtException") in the task worker. The worker reported the error via the UNCAUGHT_EXCEPTION IPC event but did not exit, and the supervisor-side handler in taskRunProcess only logged the message at debug level — leaving the run() promise orphaned until maxDuration fired and producing empty attempts (durationMs=0, costInCents=0). The supervisor now rejects the in-flight attempt with an UncaughtExceptionError and gracefully terminates the worker (preserving the OTEL flush window) on UNCAUGHT_EXCEPTION. The attempt fails fast with TASK_EXECUTION_FAILED, surfacing the original error name, message, and stack trace, and falls under the normal retry policy. This mirrors the existing indexing-side behavior in indexWorkerManifest. Apply the same handling to unhandled promise rejections, which Node already routes through uncaughtException by default.
1 parent 6e8b039 commit 62e0066

9 files changed

Lines changed: 164 additions & 1 deletion

File tree

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
"trigger.dev": patch
3+
"@trigger.dev/core": patch
4+
---
5+
6+
Fail attempts on uncaught exceptions instead of hanging to `MAX_DURATION_EXCEEDED`. A Node `EventEmitter` (e.g. `node-redis`) emitting `"error"` with no `.on("error", ...)` listener escalates to `uncaughtException`, which the worker previously reported but did not act on — runs drifted to maxDuration with empty attempts. They now fail fast with the original error and status `FAILED`, and respect the task's normal retry policy. You should still attach `.on("error", ...)` listeners to long-lived clients to handle errors gracefully.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
area: run-engine
3+
type: fix
4+
---
5+
6+
Map the new `TASK_RUN_UNCAUGHT_EXCEPTION` internal-error code to
7+
`COMPLETED_WITH_ERRORS` (Failed) status in `runStatusFromError`. cli-v3
8+
now emits this code when the worker process surfaces an uncaught
9+
exception (e.g. a Node EventEmitter emitting `"error"` with no listener),
10+
so the run renders as a regular task failure in the dashboard rather
11+
than a system failure, while still routing through the engine's
12+
`lockedRetryConfig` lookup so the user's retry policy is honoured.

docs/troubleshooting.mdx

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -278,6 +278,55 @@ You could also offload the CPU-heavy work to a Node.js worker thread, but this i
278278

279279
If the above doesn't work, then we recommend you try increasing the machine size of your task. See our [machines guide](/machines) for more information.
280280

281+
### Uncaught exceptions
282+
283+
If you see a `TASK_RUN_UNCAUGHT_EXCEPTION` error, an exception escaped your task's `run()` function without being thrown through your `await` chain — the runtime caught it via Node's `process.on("uncaughtException")` handler. The dashboard surfaces this as a regular task failure (status `Failed`) and the run will retry according to your task's retry policy, but the exception still indicates a bug worth fixing.
284+
285+
The most common cause is a Node `EventEmitter` emitting an `"error"` event with no listener attached. When this happens, Node escalates the event into an `uncaughtException`. Long-lived clients like `node-redis`, `pg`, `kafkajs`, and `mongodb` all surface socket-level errors this way.
286+
287+
For example, a `node-redis` client with no error listener will fail your run with an `Error: read ECONNRESET` (or similar TCP error) the next time the socket is reset:
288+
289+
```ts
290+
import { task } from "@trigger.dev/sdk";
291+
import { createClient } from "redis";
292+
293+
export const myTask = task({
294+
id: "my-task",
295+
run: async () => {
296+
const client = createClient({ url: process.env.REDIS_URL });
297+
298+
// BAD: no .on("error", ...) listener — a socket reset will crash the run
299+
// with an uncaught exception, even if the next .get() would have worked.
300+
await client.connect();
301+
return await client.get("foo");
302+
},
303+
});
304+
```
305+
306+
Fix it by attaching an `error` listener so the event has somewhere to go:
307+
308+
```ts
309+
const client = createClient({ url: process.env.REDIS_URL });
310+
311+
// GOOD: the listener catches socket-level errors. The awaited command
312+
// (e.g. .get) will still reject if the connection is broken, and that
313+
// rejection propagates through your run() and fails the attempt cleanly.
314+
client.on("error", (err) => {
315+
logger.warn("Redis client error", { err });
316+
});
317+
318+
await client.connect();
319+
return await client.get("foo");
320+
```
321+
322+
The same fix applies to any library that emits `"error"` events. As a rule, attach an `.on("error", ...)` listener to every long-lived client you create inside a task.
323+
324+
<Note>
325+
326+
Unhandled promise rejections (e.g. `Promise.reject(...)` with no `.catch`) take the same path — Node routes them through `uncaughtException` by default, and the runtime treats them as `TASK_RUN_UNCAUGHT_EXCEPTION` for the same reasons. Make sure every promise either gets `await`ed or has a `.catch(...)` handler.
327+
328+
</Note>
329+
281330
### Realtime stream error (`sendBatchNonBlocking` / `S2AppendSession`)
282331

283332
Errors mentioning `sendBatchNonBlocking`, `@s2-dev/streamstore`, or `S2AppendSession` (often with `code: undefined`) can occur when you close a stream and then await `waitUntilComplete()`, or when a stream runs for a long time (e.g. 20+ minutes). Wrap `waitUntilComplete()` in try/catch so Transport/closed-stream errors don't fail your task:

internal-packages/run-engine/src/engine/errors.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ export function runStatusFromError(
1919
case "TASK_INPUT_ERROR":
2020
case "TASK_OUTPUT_ERROR":
2121
case "TASK_MIDDLEWARE_ERROR":
22+
case "TASK_RUN_UNCAUGHT_EXCEPTION":
2223
return "COMPLETED_WITH_ERRORS";
2324
case "TASK_RUN_CANCELLED":
2425
return "CANCELED";

packages/cli-v3/src/executions/taskRunProcess.test.ts

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import { TaskRunProcess, type TaskRunProcessOptions } from "./taskRunProcess.js";
22
import { describe, it, expect, vi } from "vitest";
3-
import { UnexpectedExitError } from "@trigger.dev/core/v3/errors";
3+
import { UncaughtExceptionError, UnexpectedExitError } from "@trigger.dev/core/v3/errors";
44
import type {
55
TaskRunExecution,
66
TaskRunExecutionPayload,
@@ -118,4 +118,38 @@ describe("TaskRunProcess", () => {
118118
}
119119
});
120120
});
121+
122+
describe("parseExecuteError(UncaughtExceptionError)", () => {
123+
it("returns INTERNAL_ERROR with TASK_RUN_UNCAUGHT_EXCEPTION + original message and stack", () => {
124+
const error = new UncaughtExceptionError(
125+
{
126+
name: "Error",
127+
message: "read ECONNRESET",
128+
stack:
129+
"Error: read ECONNRESET\n at TCP.onStreamRead (node:internal/stream_base_commons:216:20)",
130+
},
131+
"uncaughtException"
132+
);
133+
134+
const result = TaskRunProcess.parseExecuteError(error);
135+
136+
expect(result.type).toBe("INTERNAL_ERROR");
137+
expect(result.code).toBe("TASK_RUN_UNCAUGHT_EXCEPTION");
138+
expect(result.message).toBe("read ECONNRESET");
139+
expect(result.stackTrace).toContain("TCP.onStreamRead");
140+
});
141+
142+
it("uses the same code for unhandledRejection origin", () => {
143+
const error = new UncaughtExceptionError(
144+
{ name: "TypeError", message: "boom" },
145+
"unhandledRejection"
146+
);
147+
148+
const result = TaskRunProcess.parseExecuteError(error);
149+
150+
expect(result.type).toBe("INTERNAL_ERROR");
151+
expect(result.code).toBe("TASK_RUN_UNCAUGHT_EXCEPTION");
152+
expect(result.message).toBe("boom");
153+
});
154+
});
121155
});

packages/cli-v3/src/executions/taskRunProcess.ts

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ import {
3333
MaxDurationExceededError,
3434
UnexpectedExitError,
3535
SuspendedProcessError,
36+
UncaughtExceptionError,
3637
} from "@trigger.dev/core/v3/errors";
3738

3839
export type OnSendDebugLogMessage = InferSocketMessageSchema<
@@ -205,6 +206,18 @@ export class TaskRunProcess {
205206
},
206207
UNCAUGHT_EXCEPTION: async (message) => {
207208
logger.debug("uncaught exception in task run process", { ...message });
209+
210+
// The worker process reports uncaught exceptions and unhandled rejections via this
211+
// event, but does not exit on its own. If we don't terminate the attempt here, run()
212+
// hangs (the awaited promise that triggered the throw is orphaned) until maxDuration
213+
// expires — surfacing as TIMED_OUT/MAX_DURATION_EXCEEDED with empty attempts. Reject
214+
// any pending attempts now and gracefully terminate the worker so OTEL gets a flush
215+
// window before SIGKILL.
216+
this.#rejectPendingAttempts(
217+
new UncaughtExceptionError(message.error, message.origin)
218+
);
219+
220+
await this.#gracefullyTerminate(this.options.gracefulTerminationTimeoutInMs);
208221
},
209222
SEND_DEBUG_LOG: async (message) => {
210223
this.onSendDebugLog.post(message);
@@ -339,6 +352,23 @@ export class TaskRunProcess {
339352
logger.debug("child process error", { error, pid: this.pid });
340353
}
341354

355+
#rejectPendingAttempts(error: Error) {
356+
for (const [id, status] of this._attemptStatuses.entries()) {
357+
if (status !== "PENDING") {
358+
continue;
359+
}
360+
361+
this._attemptStatuses.set(id, "REJECTED");
362+
363+
const attemptPromise = this._attemptPromises.get(id);
364+
if (!attemptPromise) {
365+
continue;
366+
}
367+
368+
attemptPromise.rejecter(error);
369+
}
370+
}
371+
342372
async #handleExit(code: number | null, signal: NodeJS.Signals | null) {
343373
logger.debug("handling child exit", { code, signal, pid: this.pid });
344374

@@ -559,6 +589,21 @@ export class TaskRunProcess {
559589
};
560590
}
561591

592+
if (error instanceof UncaughtExceptionError) {
593+
// Dedicated INTERNAL_ERROR code so the engine handles retry via the
594+
// existing crash-style lookup of run.lockedRetryConfig (same pathway as
595+
// TASK_PROCESS_EXITED_WITH_NON_ZERO_CODE etc.) and so the dashboard
596+
// renders this as "Failed" rather than "System failure" — the exception
597+
// was raised by user code (or a dependency the user controls, e.g. an
598+
// EventEmitter "error" event with no listener), not a platform fault.
599+
return {
600+
type: "INTERNAL_ERROR",
601+
code: TaskRunErrorCodes.TASK_RUN_UNCAUGHT_EXCEPTION,
602+
message: error.originalError.message,
603+
stackTrace: error.originalError.stack,
604+
};
605+
}
606+
562607
return {
563608
type: "INTERNAL_ERROR",
564609
code: TaskRunErrorCodes.TASK_EXECUTION_FAILED,

packages/core/src/v3/errors.ts

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -395,6 +395,7 @@ export function shouldRetryError(error: TaskRunError): boolean {
395395
case "TASK_EXECUTION_ABORTED":
396396
case "TASK_EXECUTION_FAILED":
397397
case "TASK_RUN_CRASHED":
398+
case "TASK_RUN_UNCAUGHT_EXCEPTION":
398399
case "TASK_PROCESS_EXITED_WITH_NON_ZERO_CODE":
399400
case "TASK_PROCESS_SIGTERM":
400401
return true;
@@ -425,6 +426,7 @@ export function shouldLookupRetrySettings(error: TaskRunError): boolean {
425426
case "TASK_PROCESS_EXITED_WITH_NON_ZERO_CODE":
426427
case "TASK_PROCESS_SIGTERM":
427428
case "TASK_PROCESS_SIGSEGV":
429+
case "TASK_RUN_UNCAUGHT_EXCEPTION":
428430
return true;
429431

430432
default:
@@ -722,6 +724,18 @@ const prettyInternalErrors: Partial<
722724
href: links.docs.troubleshooting.stalledExecution,
723725
},
724726
},
727+
// Link only — we deliberately do NOT set `message`, so the original
728+
// error message (e.g. "read ECONNRESET") is preserved in the dashboard.
729+
// Common cause: an EventEmitter (node-redis, pg, etc.) emitted "error"
730+
// with no listener attached, which Node escalates to uncaughtException.
731+
// The docs page explains how to attach .on("error") listeners and how
732+
// unhandled rejections route through the same path.
733+
TASK_RUN_UNCAUGHT_EXCEPTION: {
734+
link: {
735+
name: "Read our troubleshooting guide",
736+
href: links.docs.troubleshooting.uncaughtException,
737+
},
738+
},
725739
};
726740

727741
const getPrettyTaskRunError = (code: TaskRunInternalError["code"]): TaskRunInternalError => {

packages/core/src/v3/links.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ export const links = {
1515
troubleshooting: {
1616
concurrentWaits: "https://trigger.dev/docs/troubleshooting#parallel-waits-are-not-supported",
1717
stalledExecution: "https://trigger.dev/docs/troubleshooting#task-run-stalled-executing",
18+
uncaughtException: "https://trigger.dev/docs/troubleshooting#uncaught-exceptions",
1819
},
1920
concurrency: {
2021
recursiveDeadlock:

packages/core/src/v3/schemas/common.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,7 @@ export const TaskRunInternalError = z.object({
174174
"GRACEFUL_EXIT_TIMEOUT",
175175
"TASK_RUN_HEARTBEAT_TIMEOUT",
176176
"TASK_RUN_CRASHED",
177+
"TASK_RUN_UNCAUGHT_EXCEPTION",
177178
"MAX_DURATION_EXCEEDED",
178179
"DISK_SPACE_EXCEEDED",
179180
"POD_EVICTED",

0 commit comments

Comments
 (0)