Market

Prompting at Scale: What I Learned Running Generative AI Workloads on Serverless

This article was authored by Rajesh Pandey, Principal Engineer, Amazon Web Services

The adoption of Generative AI is transforming how we build applications, and the cloud, particularly serverless architectures, offers an enticing platform for these innovations. When teams first experiment with GenAI, it often starts with a proof of concept, a single prompt, a hardcoded response, and a quick integration. And serverless, with its promise of simplicity, elasticity, and pay-per-use cost model, seems like the perfect launchpad.

But things change dramatically when you move from demos to resilient, production-grade systems. I’ve worked closely with teams deploying sophisticated LLM-powered features, everything from nuanced summarization tools and contextual chatbots to real-time code reviewers and automated content generators on serverless stacks like AWS Lambda. And let me say this up front: invoking LLMs in production is nothing like querying a traditional database or calling a stateless API.

You’re dealing with models that can be comparatively slow, inherently expensive per call, functionally stateless (requiring careful context management), and prone to surprising, often unpredictable, behavior under the stress of real-world load and diverse inputs. The complexity doesn’t vanish with serverless; it shifts. Here’s what I’ve learned making GenAI reliably work and scale on serverless, and the essential guardrails I wish every architect and developer knew before shipping their first significant prompt-driven feature to production.

  1. Cold Starts Are Minor. Timeouts Are Killers.

Everyone new to serverless functions, like AWS Lambda, initially worries about cold starts. Yes, they matter, and for latency-sensitive synchronous interactions, they need to be managed. But in many GenAI workflows, especially at production scale, where thousands of requests might be processed daily, the trickier and damaging risk is your LLM call exceeding the configured timeout limit. Aggregated across many concurrent requests, even seemingly minor LLM response delays, perhaps the model is taking a few extra seconds under load, can overwhelm default limits, especially when using API Gateway (often 30 seconds) or Step Functions as the entry point.

One team I worked with had a document summarization task that performed flawlessly in local testing. In production, however, a significant percentage of requests were being abruptly cut off mid-response. The LLM wasn’t consistently “slow” in an absolute sense; it was just frequently slower than the API Gateway’s default timeout. Worse, naive retries triggered by the gateway simply re-executed the expensive function, leading to double costs, often still timing out, and delivering fragmented, garbage outputs to the user or downstream systems. This not only impacts user experience but can also lead to data inconsistencies and a loss of trust in the feature.

Fix: For any LLM interaction that might exceed a few seconds, embrace Lambda’s asynchronous invocation model. Buffer incoming requests using a service like Amazon SQS for effective decoupling, load leveling, and durable, at-least-once asynchronous processing. Your synchronous entry point (e.g., an API Gateway endpoint) should validate the request, place it on the SQS queue, and return early with a tracking token or a polling URL. A separate Lambda function, triggered by SQS, processes the prompt. The result can then be pushed to a callback URL, streamed via WebSockets, or stored for later retrieval, with SQS dead-letter queues (DLQs) used to handle tasks that repeatedly fail even in async processing.

  1. Retry Logic Needs a Complete Rewrite for LLMs

Retries are a standard, necessary pattern in distributed serverless architectures to handle transient network issues or temporary unavailability of downstream services. But retrying an LLM call is fundamentally different from retrying a typical idempotent HTTP request. You risk:

  • Nonsensical or Contradictory Outputs: Due to the often non-deterministic nature of LLMs, a retry with the exact same input can yield a slightly different, or sometimes wildly different, response. This can be disastrous if the application expects consistency.
  • Inflated Costs and Duplicate Charges: If you’re using commercial LLM APIs, each retry is another billable invocation. A poorly designed retry loop can quickly burn through budgets.
  • Confusing and Frustrating User Experiences: Imagine a user asking for a summary, getting an error, trying again, and then receiving a completely different summary, or worse, a subtly altered one that contradicts the first.

In one financial analysis application, we observed retries generating hallucinated financial metrics that weren’t present in the original input data. The root cause? The second request, though using the same primary input, encountered a slightly different transient system state (e.g., a different internal routing path or a momentary variation in a supporting microservice’s response used to build context) when fetching auxiliary data. The LLM, trying to be helpful, “creatively filled in the blanks,” leading to plausible but entirely incorrect information that could have had serious consequences.

Fix: Implement a stringent retry budget (e.g., max 1-2 retries specifically for clearly defined transient errors, using exponential backoff with jitter, before routing to a dead-letter queue or a dedicated error handling workflow). If an LLM request fails, capture the detailed error state, the full input prompt, and any relevant context. Do not re-invoke the LLM immediately for anything other than network-level failures. For application-level errors or unexpected LLM responses, prioritize logging for manual review or using a fallback mechanism. Optionally, cache successful input-output pairs from first attempts to avoid re-processing if a downstream integration failed, not the LLM itself. Ensure your logging captures enough detail (trace IDs, input hashes, error messages, attempt counts) to make manual review effective.

  1. Token Budgeting is Your New Core Resource Constraint

Traditional serverless workflows make us hyper-aware of memory (MB) and CPU time (ms). When working with GenAI, tokens become your primary resource constraint and cost driver. Too many tokens in your input prompt (which includes the user query plus any context you provide), and your request will likely be rejected by the LLM. Too many tokens are generated in the output, and your costs can spike unexpectedly, and you may hit model-specific output limits.

We encountered this early in a project. A document processing pipeline designed to summarize uploaded content worked perfectly with short, single-page documents. However, it consistently crashed when users fed it 20-page PDFs. The Lambda function itself wasn’t failing due to memory or CPU; the underlying LLM was rejecting the requests because the constructed prompt (base instructions + entire PDF text as context) massively exceeded its maximum input token limit. This can lead to user-facing errors that are hard to debug if not anticipated.

Fix: Design prompts to be modular, allowing you to construct them dynamically based on complexity and available tokens (e.g., a base prompt template with slots for dynamically fetched, concise context). Actively use techniques like embeddings to perform semantic search and fetch only the most relevant snippets of context from larger documents, rather than sending entire texts. Set strict output size expectations via prompt engineering (e.g., “summarize in under 150 words”). Most importantly, track token usage (input tokens, output tokens, total tokens) per invocation as a first-class operational metric. Log this data, associate it with users or features, and monitor it closely to understand cost drivers and prevent abuse or runaway scenarios.

  1. Observability Has to Go Deeper Than Standard Logs

You can’t optimize or debug what you can’t see. With the added complexity and “black box” nature of LLMs, standard serverless logs (e.g., basic Lambda execution logs) alone won’t cut it for production GenAI systems. You need deep, contextual observability into the entire lifecycle of a prompt. You must be able to answer questions like:

  • Which specific inputs or prompt variations are causing high latencies or errors?
  • What exact context (and how much of it) was injected into each prompt that led to a problematic output?
  • Where are the cost hotspots originating from, which users, which features, which types of prompts?
  • How often are models outputting responses that are too short, too long, or potentially unsafe?

In our production systems, we log every prompt template used, the actual (anonymized where necessary) input, the generated output, token counts for both, measured latencies for the LLM call specifically, and any metadata about the model version. We use distributed tracing with unique trace IDs to follow a single conceptual request from its initial event ingestion, through any data enrichment steps, to the LLM invocation, and finally to the output rendering or storage. This structured, detailed observability, often ingested into platforms like OpenSearch or specialized APM tools with custom instrumentation, has been invaluable. It helped us catch performance regressions early, understand the nuances of prompt effectiveness, and avoid major surprises in our monthly cloud bills by pinpointing inefficient or overly expensive interactions.

Conclusion: Serverless GenAI – Immensely Powerful, When Engineered with Discipline

Despite these operational complexities, I remain incredibly bullish on running Generative AI workloads on serverless architectures. The fundamental benefits of fine-grained scaling without pre-provisioning commitment, isolating costs per request, and the ability to rapidly iterate on event-driven services align perfectly with the dynamic nature of many GenAI applications. You can indeed spike to thousands of concurrent prompts per second without managing a single server.

But achieving this potential reliably and cost-effectively takes discipline and a shift in mindset. It requires embracing:

  • Asynchronous-first design patterns to gracefully handle LLM latencies.
  • Rigorous token and retry governance to manage costs and ensure predictable behavior.
  • Full-funnel, prompt-centric observability to illuminate the “black box.”

Generative AI isn’t a magical solution that effortlessly scales itself. It’s a powerful new type of workload, but one that’s deceptively easy to underestimate in terms of its production demands. The journey is one of continuous learning and refinement, as the models themselves and the best practices for using them are still rapidly evolving.

If you treat your GenAI components like fragile toys, they will inevitably break under pressure. But if you engineer them like critical systems, with the right guardrails and operational diligence, they will scale and unlock remarkable new capabilities.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button