Modern applications increasingly require complex and long-running coordination between services, such as multi-step payment processing, AI agent orchestration, or approval processes awaiting human decisions. Building these traditionally required significant effort to implement state management, handle failures, and integrate multiple infrastructure services.
Starting today, you can use AWS Lambda durable functions to build reliable multi-step applications directly within the familiar AWS Lambda experience. Durable functions are regular Lambda functions with the same event handler and integrations you already know. You write sequential code in your preferred programming language, and durable functions track progress, automatically retry on failures, and suspend execution for up to one year at defined points, without paying for idle compute during waits.
AWS Lambda durable functions use a checkpoint and replay mechanism, known as durable execution, to deliver these capabilities. After enabling a function for durable execution, you add the new open source durable execution SDK to your function code. You then use SDK primitives like “steps” to add automatic checkpointing and retries to your business logic and “waits” to efficiently suspend execution without compute charges. When execution terminates unexpectedly, Lambda resumes from the last checkpoint, replaying your event handler from the beginning while skipping completed operations.
Getting started with AWS Lambda durable functions
Let me walk you through how to use durable functions.
First, I create a new Lambda function in the console and select Author from scratch. In the Durable execution section, I select Enable. Note that, durable function setting can only be set during function creation and currently can’t be modified for existing Lambda functions.

After I create my Lambda durable function, I can get started with the provided code.

Lambda durable functions introduces two core primitives that handle state management and recovery:
context.step() method adds automatic retries and checkpointing to your business logic. After a step is completed, it will be skipped during replay.context.wait() method pauses execution for a specified duration, terminating the function, suspending and resuming execution without compute charges.Additionally, Lambda durable functions provides other operations for more complex patterns: create_callback() creates a callback that you can use to await results for external events like API responses or human approvals, wait_for_condition() pauses until a specific condition is met like polling a REST API for process completion, and parallel() or map() operations for advanced concurrency use cases.
Building a production-ready order processing workflow
Now let’s expand the default example to build a production-ready order processing workflow. This demonstrates how to use callbacks for external approvals, handle errors properly, and configure retry strategies. I keep the code intentionally concise to focus on these core concepts. In a full implementation, you could enhance the validation step with Amazon Bedrock to add AI-powered order analysis.
Here’s how the order processing workflow works:
validate_order() checks order data to ensure all required fields are present.send_for_approval() sends the order for external human approval and waits for a callback response, suspending execution without compute charges.process_order() completes order processing.Here’s the complete order processing workflow with step definitions and the main handler:
import random
from aws_durable_execution_sdk_python import (
DurableContext,
StepContext,
durable_execution,
durable_step,
)
from aws_durable_execution_sdk_python.config import (
Duration,
StepConfig,
CallbackConfig,
)
from aws_durable_execution_sdk_python.retries import (
RetryStrategyConfig,
create_retry_strategy,
)
@durable_step
def validate_order(step_context: StepContext, order_id: str) -> dict:
"""Validates order data using AI."""
step_context.logger.info(f"Validating order: {order_id}")
# In production: calls Amazon Bedrock to validate order completeness and accuracy
return {"order_id": order_id, "status": "validated"}
@durable_step
def send_for_approval(step_context: StepContext, callback_id: str, order_id: str) -> dict:
"""Sends order for approval using the provided callback token."""
step_context.logger.info(f"Sending order {order_id} for approval with callback_id: {callback_id}")
# In production: send callback_id to external approval system
# The external system will call Lambda SendDurableExecutionCallbackSuccess or
# SendDurableExecutionCallbackFailure APIs with this callback_id when approval is complete
return {
"order_id": order_id,
"callback_id": callback_id,
"status": "sent_for_approval"
}
@durable_step
def process_order(step_context: StepContext, order_id: str) -> dict:
"""Processes the order with retry logic for transient failures."""
step_context.logger.info(f"Processing order: {order_id}")
# Simulate flaky API that sometimes fails
if random.random() > 0.4:
step_context.logger.info("Processing failed, will retry")
raise Exception("Processing failed")
return {
"order_id": order_id,
"status": "processed",
"timestamp": "2025-11-27T10:00:00Z",
}
@durable_execution
def lambda_handler(event: dict, context: DurableContext) -> dict:
try:
order_id = event.get("order_id")
# Step 1: Validate the order
validated = context.step(validate_order(order_id))
if validated["status"] != "validated":
raise Exception("Validation failed") # Terminal error - stops execution
context.logger.info(f"Order validated: {validated}")
# Step 2: Create callback
callback = context.create_callback(
name="awaiting-approval",
config=CallbackConfig(timeout=Duration.from_minutes(3))
)
context.logger.info(f"Created callback with id: {callback.callback_id}")
# Step 3: Send for approval with the callback_id
approval_request = context.step(send_for_approval(callback.callback_id, order_id))
context.logger.info(f"Approval request sent: {approval_request}")
# Step 4: Wait for the callback result
# This blocks until external system calls SendDurableExecutionCallbackSuccess or SendDurableExecutionCallbackFailure
approval_result = callback.result()
context.logger.info(f"Approval received: {approval_result}")
# Step 5: Process the order with custom retry strategy
retry_config = RetryStrategyConfig(max_attempts=3, backoff_rate=2.0)
processed = context.step(
process_order(order_id),
config=StepConfig(retry_strategy=create_retry_strategy(retry_config)),
)
if processed["status"] != "processed":
raise Exception("Processing failed") # Terminal error
context.logger.info(f"Order successfully processed: {processed}")
return processed
except Exception as error:
context.logger.error(f"Error processing order: {error}")
raise error # Re-raise to fail the execution
This code demonstrates several important concepts:
process_order step, exceptions trigger automatic retries based on the default (step 1) or configured RetryStrategy (step 5). This handles transient failures like temporary API unavailability.context.logger for the main handler and step_context.logger inside steps. The context logger suppresses duplicate logs during replay.Now I create a test event with order_id and invoke the function asynchronously to start the order workflow. I navigate to the Test tab and fill in the optional Durable execution name to identify this execution. Note that, durable functions provides built-in idempotency. If I invoke the function twice with the same execution name, the second invocation returns the existing execution result instead of creating a duplicate.

I can monitor the execution by navigating to the Durable executions tab in the Lambda console:

Here I can see each step’s status and timing. The execution shows CallbackStarted followed by InvocationCompleted, which indicates the function has terminated and execution is suspended to avoid idle charges while waiting for the approval callback.

I can now complete the callback directly from the console by choosing Send success or Send failure, or programmatically using the Lambda API.

I choose Send success.

After the callback completes, the execution resumes and processes the order. If the process_order step fails due to the simulated flaky API, it automatically retries based on the configured strategy. Once all retries succeed, the execution completes successfully.

Monitoring executions with Amazon EventBridge
You can also monitor durable function executions using Amazon EventBridge. Lambda automatically sends execution status change events to the default event bus, allowing you to build downstream workflows, send notifications, or integrate with other AWS services.
To receive these events, create an EventBridge rule on the default event bus with this pattern:
{
"source": ["aws.lambda"],
"detail-type": ["Durable Execution Status Change"]
}
Things to know
Here are key points to note:
Get started with AWS Lambda durable functions by visiting the AWS Lambda console. To learn more, refer to AWS Lambda durable functions documentation page.
Happy building!
— Donnie
Source: AWS News