BAA Building Agentic AI

Blog / Enterprise / Sharing My AWS Certified Generative AI Developer (AIP-C01) Prep Notes

Sharing My AWS Certified Generative AI Developer (AIP-C01) Prep Notes

Architecture-focused prep notes for the AWS Certified Generative AI Developer Professional (AIP-C01) exam: Bedrock, RAG, agents, guardrails, cost and latency, evaluation, and common exam traps.

MA

Muhammad Arbab · 14 years building production AI

· 54 min read · Enterprise

Over the past few weeks (Jan 2026), while preparing for the AWS Certified Generative AI Developer, Professional (AIP-C01) exam, I ended up creating fairly detailed notes across all exam domains.

Instead of keeping them private, I decided to share them with the community, especially for anyone preparing for this certification or building responsible, production-grade GenAI applications on AWS.

I have used ChatGPT to organize, structure, and format these notes into something more readable. I am sharing them as-is, purely to help others validate their understanding and speed up their preparation.

⚠️ Important disclaimer: These are personal preparation notes, not official AWS material. Please always cross-check with the latest AWS exam guide and course outlines before relying on anything here. If you spot mistakes or have suggestions for improvement, I’d genuinely appreciate you reaching out.

What the notes cover (aligned to exam domains)

The content broadly maps to the official AIP-C01 blueprint and focuses on architecture decisions and trade-offs, not memorization.

Domain 1: Foundation Model Integration & Data Management

  • Selecting the right foundation model (Bedrock vs SageMaker, Titan vs Claude vs others)
  • Designing GenAI architectures on AWS
  • Retrieval-Augmented Generation (RAG)
  • Vector databases and chunking strategies
  • Data quality, ingestion, compliance, and residency considerations

Domain 2: Implementation & Integration Patterns

  • When to use Agents vs Orchestration
  • AWS Bedrock Agents and the ReAct pattern
  • Step Functions for deterministic workflows
  • Secure tool and API integration
  • Real-world integration patterns (Lambda, API Gateway, AppConfig, etc.)

Domain 3: AI Safety, Security & Governance

  • Bedrock Guardrails (content filtering, PII protection, grounding)
  • IAM, identity federation, and access control
  • Logging, auditing, and observability
  • Responsible AI practices, bias detection, and human-in-the-loop patterns

Domain 4: Operational Efficiency & Optimization

  • On-demand vs provisioned throughput in Bedrock
  • Latency optimization (streaming responses, caching)
  • Cost vs performance trade-offs
  • Scaling GenAI workloads and vector search systems

Domain 5: Testing, Validation & Troubleshooting

  • Prompt testing and prompt management
  • Model evaluation vs monitoring
  • Bias and drift detection
  • RAG troubleshooting (retrieval vs generation problems)

Additional high-yield topics included

Beyond the core domains, the notes also touch on:

  • GenAI customization techniques Fine-tuning vs Continued Pre-Training (CPT) vs LoRA / adapters: and when each actually makes sense.
  • RAG & vector search best practices Hybrid search, re-ranking, metadata filtering, chunking strategies, and common failure modes.
  • Prompt engineering & prompt management Few-shot prompting, Chain-of-Thought, ReAct, prompt versioning, and evaluation.
  • Common exam traps & real-world shortcuts Infrastructure vs prompt controls, guardrails vs IAM, retrieval issues vs generation issues, and “AWS-native over DIY” patterns.

How I recommend using these notes

  • Use them as a validation checklist, not a single source of truth
  • Pair them with hands-on practice (Bedrock, OpenSearch, Step Functions, etc.)
  • Always cross-reference with the latest AWS exam guide
  • Treat them as architecture thinking aids, not memorization material

The original notes were compiled from my personal preparation and the official AWS exam guide, then lightly structured for readability with the help of GenAI itself.

Domain 1: Foundation Model Integration & Data Management

Domain 1 covers planning and building the core of GenAI applications: choosing foundation models, integrating them into AWS solutions, handling data for model consumption, and ensuring compliance with data requirements. This domain represents the largest portion of the exam (31%) and includes Retrieval-Augmented Generation (RAG) techniques, vector databases, and data processing pipelines.

Designing GenAI Solutions & Selecting FMs

An AWS GenAI developer must analyze use-case requirements and design an architecture using appropriate foundation models and services. Key considerations include model capabilities, cost, latency, and integration ease. High-level strategies:

  • Foundation Models (FMs): Recognize different types of FMs (large language models, text-to-image models, embedding models, multimodal models, etc.) available on AWS. Amazon Bedrock offers a model catalog of pre-trained FMs (Amazon Titan, Anthropic Claude, Meta Llama, Cohere, etc.), accessible via API without managing infrastructure. SageMaker JumpStart allows deploying open-source models or custom models in your own environment when you need more control or custom fine-tuning. Choose Bedrock for a fully managed, serverless experience and easy integration, versus SageMaker for hosting models not available on Bedrock or when you require custom model weights and specific instance types.
  • Model Selection Strategy: For a given task, pick a model that aligns with requirements for context length, modality, and cost. For example, Amazon Titan is a family of generative models that are cost-effective for general tasks (Titan Text for text generation, Titan Embeddings for multilingual embeddings, Titan Image for image generation with responsible AI features like invisible watermarking). Anthropic Claude models (Haiku, Sonnet, Opus) trade off cost vs. intelligence: Claude Haiku is cheap and fast for simple tasks, Claude Opus is powerful for complex reasoning but more expensive. Cohere Command models excel at enterprise search and RAG use cases, and Cohere also provides strong multilingual embedding models. If fine-tuning or custom weights are needed, open-source models like Meta Llama or Mistral can be fine-tuned and then imported into Bedrock or hosted on SageMaker (Llama for general-purpose text, Mistral for high performance at lower latency). An important exam skill is knowing which model to pick for a scenario, consider factors like whether the task is text vs. image, need for long context, multilingual support, cost sensitivity, etc.
  • Dynamic Model Selection: In production, you may need the ability to switch or upgrade models without code changes. AWS AppConfig is a service that can store configuration flags or parameters (like which model ID or endpoint to use) that your application (running in Lambda, container, etc.) reads at runtime. By externalizing model selection in AppConfig, you can roll out model changes safely (feature flags, gradual deployments) without redeploying code. This pattern is mentioned in the exam for implementing “dynamic model selection without code modifications”: the expected answer is to use AppConfig (and potentially Lambda or API Gateway integration) to decouple the model choice from the code.
  • Resilience and Multi-Region: GenAI apps should be designed to handle model availability issues. For instance, Amazon Bedrock Cross-Region Inference allows you to route requests across multiple AWS Regions for certain FMs, which improves resilience and can balance load during peak times. This is useful if a model is only available in one region or if you want automatic failover for high availability. Cross-region profiles effectively double available throughput and provide regional redundancy at no extra cost, at the expense of data leaving the original region (consider data residency impact). Other resilience patterns include using AWS Step Functions with a circuit breaker to halt an agent or generation loop that’s failing (to prevent infinite loops or runaway costs): e.g., detect if a loop exceeds an iteration count or error rate and break out safely.

Data Ingestion, Validation & Processing for FMs

Garbage in, garbage out, ensuring high-quality data input is crucial for reliable FM outputs. Exam Domain 1 expects knowledge of setting up data pipelines that validate and preprocess data for model consumption.

  • Data Validation: Use tools like AWS Glue Data Quality to enforce rules on your data before it’s used by a model. Glue Data Quality can run rule-based checks on datasets (e.g., ensure no empty values, correct schema, value ranges) and even generate data quality metrics. For example, before feeding documents into a vector embedding pipeline, you might use Glue Data Quality to verify required fields or remove corrupt records. Coupling Glue Data Quality with Amazon CloudWatch metrics allows you to monitor data quality trends over time and set alarms for anomalies. In exam scenarios that mention “validate data quality with rule-based checks and detect anomalies,” the expected solution is Glue Data Quality + CloudWatch (Glue provides the checks; CloudWatch tracks metrics/alerts).
  • Data Processing: Handling different data modalities often requires specialized AWS services: Text: For plain text, you may need to clean or normalize it. If chunking text for a vector store, ensure consistent encoding (UTF-8) and handle special characters. For complex or large text, consider chunking (see RAG section below). Documents (PDFs, etc.): Use Amazon Textract for OCR and form data extraction from PDFs or images, rather than writing custom OCR. Textract can provide structured text that you then embed or analyze. Bedrock Data Automation (BDA) also provides blueprints to extract structured data from unstructured docs, which can help parse PDFs and images before feeding into a GenAI workflow. Speech/Audio: Use Amazon Transcribe to convert audio to text for downstream text-based FMs. Custom vocabulary and language models in Transcribe can improve accuracy for domain-specific terms. Structured data (tables, CSV): Likely loaded via AWS Glue or direct into a prompt if small. Not a major focus unless combining with LLM for analysis. Images (for multimodal models): Ensure correct format (e.g., base64 encoding if passing via API). For generative image models (like Titan Image), consider size limitations and that Titan will watermark outputs invisibly for responsible AI.
  • Data Compliance & Residency: Many scenarios require that sensitive data be handled carefully. The exam explicitly notes that model training is out of scope for the candidate, but using models responsibly with data is in scope. Understand that Amazon Bedrock keeps data in-region by default (it does not move your payloads across regions). If you enable cross-region inference or use a multi-region solution, know that data will leave the region (though staying within a certain geographic boundary, e.g., US to US). For strict data residency, you might use AWS Outposts to host services on-premises or in-country: for example, pre-process or anonymize data on Outposts, then send sanitized data to Bedrock in an allowed region. This pattern appears if a question mentions an on-prem requirement or “cannot send data to the cloud”; Outposts can be an answer to keep data local while still using AWS services. Always consider using Service Control Policies (SCPs) to enforce that Bedrock APIs are only invoked in approved regions for compliance.

Retrieval-Augmented Generation (RAG) and Vector Stores

A significant portion of Domain 1 (and exam questions) is about Retrieval-Augmented Generation (RAG), where a model’s prompts are augmented with relevant data retrieved from a knowledge source. RAG architecture involves storing embeddings of documents and retrieving the top relevant chunks to include in the model’s context, thereby grounding the model’s responses in factual data. Key topics include choosing vector databases, indexing strategies, chunking text, and improving retrieval relevance.

  • Vector Databases Options: AWS offers multiple ways to store embeddings (vector representations): Amazon OpenSearch Serverless with vector search: A popular and highly likely exam topic. To use OpenSearch for vectors, you must create a Vector Search collection (not a standard search or time-series collection). OpenSearch Serverless measures capacity in OCUs (OpenSearch Compute Units): minimum 4 OCUs (2 for indexing, 2 for search) for a basic setup. It stores vectors in memory for fast similarity search. A big advantage is managed scaling and integration (there’s a Bedrock Vector Connector for OpenSearch). Exam tip: If given a choice between using OpenSearch vs. building a custom vector store on, say, DynamoDB or S3, prefer OpenSearch if you need real-time vector similarity search. S3 is not a real-time vector DB by itself (S3 might store embeddings files but cannot do similarity queries). Amazon Bedrock Knowledge Bases: A fully managed RAG service that handles ingestion, chunking, embedding, and storage for you. Bedrock Knowledge Bases can use OpenSearch (serverless), Amazon Aurora PostgreSQL with pgvector, Amazon Neptune (for knowledge graphs), or S3-based vector indexes as the backing store. Essentially, it’s an “easy button” for RAG, you point it at data in S3 or use connectors (to SaaS like Confluence, etc.) and Bedrock manages creating embeddings and allows the model to RetrieveAndGenerate in one go. If the question mentions minimizing custom code for RAG or needing a managed solution, Bedrock Knowledge Bases is a strong candidate for the correct answer. Amazon Aurora PostgreSQL with pgvector: Suited if you already use a relational database and want to add vector search to it. Aurora with the pgvector extension allows you to store and query embeddings alongside relational data, useful for extending an existing app’s database with semantic search. Not the first choice for large-scale vector search unless integration with relational data is needed. Others: In some scenarios, Amazon DynamoDB + an external vector library could appear, or third-party solutions. But AWS now has native solutions, so the exam leans towards those. For in-memory conversation history (chatbot session memory), Amazon MemoryDB (Redis) can be used to quickly store recent interaction embeddings or text, it’s not a true vector DB, but it’s mentioned for short-term memory use cases. If the question scenario is about maintaining a conversation context or session state cheaply, a Redis-based solution might be indicated.
  • Data Ingestion and Sync: For any vector store, keeping it updated with the latest documents is important. With Bedrock Knowledge Bases, you ingest data from S3 via a StartIngestionJob API or automated triggers (e.g., an EventBridge event when new documents land in S3). A common exam point: Vector stores do not automatically stay in sync with changes in the source data: you must design a mechanism to re-ingest or update the index when new data arrives. For example, if new documents are added daily to S3, you might schedule a daily ingestion job or use event notifications. If timeliness matters, consider architectures for near-real-time updates (Lambda triggered on S3 put to add the new embedding).
  • Chunking Strategies: Breaking documents into chunks is crucial for effective retrieval. Strategies include: Fixed-size Chunking: Split text into uniform chunks (e.g. 300 tokens each). Simple and works for documents that don’t have much internal structure. Hierarchical Chunking (Parent-Child): For documents with sections/subsections (like legal contracts or articles), this approach creates small fine-grained chunks (children) but retains links to larger sections (parents). The retrieval can then return a larger section for context if a small chunk was relevant. This preserves context hierarchy: likely to be mentioned if the question is about maintaining relationships in documents (e.g., an HTML article made of paragraphs). Semantic Chunking: Instead of fixed sizes, use semantic boundaries, e.g., split when the topic shifts or based on meaning. This can be done by looking for paragraph/topic boundaries or even using an embedding similarity approach to decide chunk breaks. Useful for narrative text that isn’t uniform, so that each chunk is about a coherent subtopic.

Exam Tip: If a scenario points out that relevant info is being missed or that chunks are too large/small, think about adjusting the chunking strategy. For example, if irrelevant content is being retrieved, maybe chunks are too large (contain mixed topics), smaller or semantic chunks could help. Or if important context is split across chunks, a hierarchical approach might help ensure the model sees the broader context.

  • Improving Retrieval Relevance: There are advanced techniques to boost the quality of retrieved context (the exam may present a troubleshooting scenario where retrieved chunks are not improving the answer, and ask how to fix it without retraining the model): Hybrid Search: Combine lexical keyword search (e.g., BM25) with vector similarity search. This captures exact matches (like specific error codes, IDs, or rare terms) via keywords and general semantic matches via vectors. OpenSearch supports hybrid searches natively. For instance, if the model is ignoring error codes because pure vector search doesn’t consider them, adding keyword search ensures those exact terms count. Hybrid search addresses cases where certain precise tokens are needed in results. Metadata Filtering: Use metadata tags on documents (like date, author, category) to filter search results to only relevant subsets. For example, for a query about 2023 financial data, filter documents where year=2023. In AWS, if using Bedrock Knowledge Bases with S3, you can attach a JSON metadata file to each document in S3 (or store metadata in OpenSearch index). Pre-filtering by metadata (e.g., only search within department=Legal documents if a legal query) can vastly improve precision. Exam Tip: Pre-filtering yields performance benefits if metadata is accurate, while post-filtering (filter results after retrieval) might be used if metadata isn’t perfectly indicative; pre-filter is generally preferred for efficiency. Re-Ranking: Retrieve a larger set of candidates (say top 50 vectors), then use a second stage to rank them by relevance with a more precise model. For example, feed the top documents into a specialized re-ranker model (like a cross-encoder or Cohere Rerank) to score actual relevance. Return the top 3 to 5 to the LLM. This improves precision by using a slower but better method after the initial fast search. Query Expansion (or Reformulation): Use an LLM to transform a vague user query into multiple, more specific queries. For instance, user asks “benefits package,” the system expands to “health insurance coverage” and “401k matching policy” to search the KB. This increases recall (finding relevant info that the original phrasing might miss). Hypothetical Document Embeddings (HyDE): A clever approach where the model first generates a hypothetical answer to the question, then that answer is embedded and used to query the vector store. The idea is the model’s own guess might be semantically closer to the actual answer passages. Use this if user queries are very short or abstract: it can improve retrieval hit rate. Query Decomposition: Break complex multi-part queries into simpler sub-queries. For example, a question asks for comparison across years; an agent can split it into two queries (one per year) and then combine results. Bedrock Agents or Step Functions can implement this multi-step retrieval logic.

If a question describes a RAG scenario where the answers are irrelevant or incomplete even though embeddings and chunking seem correct, the likely fixes are one of the above: e.g., apply Hybrid search to capture missing keywords if numeric codes or exact phrases are being lost, or use Re-ranking if a lot of noise is in top results. Also remember the distinction: if the retrieved documents are wrong, fix retrieval (chunking, search technique), if documents are right but the model’s answer is wrong, that’s a generation issue (prompting or model choice).

Compliance & Governance in Data Management

Although domain 3 focuses on security, domain 1 also explicitly includes compliance aspects (e.g. data handling). Key points:

  • Data Encryption & Access: Ensure that any data stores (S3, OpenSearch, etc.) used in your GenAI solution have encryption at rest and in transit, and that access is controlled via IAM. Use VPC endpoints for services like Bedrock, S3, or OpenSearch if you need to ensure traffic doesn’t go over public internet (important for sensitive data scenarios).
  • Traceability: If asked how to track where training/reference data came from or how a model output was derived, consider services like AWS Glue Data Catalog to catalog data sources and transformations. Also, enabling Amazon Bedrock’s prompt/response logging (with PutModelInvocationLoggingConfiguration) can send full prompt and response payloads to S3 for auditing: note CloudWatch Logs truncates at 100KB, so S3 is needed for large payloads.
  • Data Privacy: Use tools to detect and handle sensitive data. Amazon Macie can scan S3 for PII/sensitive info: e.g., before indexing documents for RAG, run Macie to identify files with PII and treat them (mask or restrict access). Also know that Bedrock Guardrails can automatically detect PII in model outputs and either block it or redact it. For inputs, you might use Amazon Comprehend for PII detection if doing a custom pipeline (Comprehend has PII entity detection that can label and redact PII in text).

By mastering the above topics in Domain 1, you’ll handle a substantial portion of the exam. Next, we’ll look at Domain 2, which builds on these foundations by focusing on implementation patterns and integration of GenAI into applications.

Domain 2: Implementation & Integration Patterns (Agents, Orchestration, APIs)

Domain 2 (26% weight) tests your ability to implement GenAI solutions using the right architectural patterns and AWS services. This includes building agent-based systems, orchestrating multi-step workflows, integrating with external APIs/tools, and deploying GenAI components in production. You will often need to decide between using a managed service vs. custom solution for integration tasks.

Agents vs. Orchestration: Choosing the Right Approach

A central theme is when to use an AI Agent (dynamic, reasoning to take actions) versus a fixed orchestration workflow (pre-defined steps).

  • AWS Bedrock Agents: Bedrock provides a managed agents framework where an LLM can act on your behalf by calling external tools/APIs based on instructions. This uses the ReAct (Reason + Act) pattern: the agent reasons about what it needs, performs an action, observes results, and continues in a loop. With Bedrock, you define Action Groups (each grouping API calls the agent is allowed to use, e.g., AWS SDK calls, Google Drive API, etc.). The agent’s prompt is automatically augmented with a description of these actions. Bedrock AgentCore runs the agent and handles the tool invocation securely, so you don’t have to implement the agent loop logic from scratch. Use Bedrock Agents when the task is open-ended or requires adaptive decision-making: e.g., “Research this topic and create a report,” where the steps aren’t fixed and the agent might call different APIs depending on content. Agents shine for autonomous or semi-autonomous workflows where AI figures out which steps to take.
  • AWS Step Functions: A fully managed orchestration service (serverless state machine) that excels at pre-defined sequences of steps, especially when integrating with various AWS services or including human approvals. In GenAI context, Step Functions might coordinate a flow like: extract text -> summarize -> analyze sentiment -> store results. You hardcode the steps and use features like parallel Map states (e.g., summarizing thousands of documents in parallel) and error handling (retries, catch) to build robust workflows. Use Step Functions when the process is well-known and does not require the model to decide the next action: e.g., batch processing of a set of inputs, or a multi-step pipeline that always follows the same order (some exam questions refer to this as a “defined process” vs. an agent for an “open-ended task”).
  • AWS Bedrock Prompt Flows: A newer low-code tool specifically for chaining prompts and simple logic in GenAI apps. It provides a visual interface to create flows with blocks for prompts, knowledge base retrieval, conditions, and even calling Lambda functions. It’s like a lighter-weight, GenAI-focused orchestrator. If a question mentions a non-developer wanting to build a GenAI workflow visually or quickly prototype, Bedrock Prompt Flows might be the answer.

Choosing Between Them: The exam may present a scenario and you need to decide between an agent solution or an orchestrated workflow (or both). Key distinctions:

  • If the task requires the AI to dynamically decide which tools or steps to use (especially external APIs), go with an Agent. E.g., an AI that autonomously queries different databases based on user request.
  • If the workflow is well-defined (even if complex) and especially if it involves many AWS service calls and error checks, use Step Functions. E.g., processing a mortgage application through a fixed series of steps (extract data, run queries, generate summary).
  • Sometimes a combination is best: Step Functions could orchestrate high-level steps and at a certain step invoke a Bedrock Agent to handle a subtask that requires reasoning. This hybrid approach might be implicit in some answers.

Also, remember AWS Strands, a framework for multi-agent systems (agents working together). If you see mention of “multiple agents collaborating”, “agent squads,” or needing more flexible interaction than a fixed Step Functions sequence, Strands can be the answer. Strands allows agents to dynamically communicate and assign tasks to each other, which is useful for complex autonomous systems (though this is a niche topic, early material suggests knowing it if agents come up).

Integration with External Tools and APIs

Generative AI apps often need to fetch data from or send actions to external systems (e.g., databases, SaaS applications) as part of the AI workflow. AWS’s solution is to integrate those tools securely and reliably rather than via naive prompt hacking. Key concepts:

  • Model Context Protocol (MCP): This is an emerging open protocol that Bedrock Agents use to connect to external tools in a standardized way. With MCP, a tool provider (e.g., a Google Drive API) implements an MCP Server that exposes certain operations. The Bedrock agent acts as an MCP Client and communicates via the AgentCore Gateway. The Gateway acts as a traffic cop and allows inserting Interceptors: e.g., a security interceptor could enforce that certain users only access certain data, or a schema interceptor could redact sensitive info from the tool’s response before it reaches the agent. The exam likely won’t ask detail on MCP, but understand that Bedrock Agents don’t directly call third-party APIs by scraping text; instead, they go through a structured interface (MCP or built-in AWS integrations). If you see a choice about “implement custom Lambda for tool integration” vs. “use AgentCore Gateway with MCP,” the latter is the modern, preferred approach.
  • AWS Lambda / API Gateway: Traditional integration method: e.g., if an LLM needs to get data from an internal system, one pattern is to have the LLM call an API (API Gateway + Lambda) that you’ve exposed. For example, an agent Action could be hitting your API Gateway endpoint which triggers Lambda to fetch from a database. This was common before MCP; the exam might still include scenarios where you use API Gateway+Lambda to let the AI system retrieve data safely (like retrieving data via a function calling mechanism). If a question mentions needing to let the model get data from a DB or call proprietary logic, wrapping that in a Lambda function and allowing the agent or workflow to call it is a secure way (don’t have the model directly query the DB).
  • Secure Authentication Flow (for Agents on user’s behalf): Bedrock supports an “agent acts as user” scenario. For instance, a user logs into an app and the agent then, with user’s permission, accesses the user’s Google Drive files. The flow typically is: User authenticates (say via Cognito or another IdP) and the app gets a JWT/OIDC token for the user. When the app calls InvokeAgent, it passes that user token. Bedrock’s AgentCore validates and exchanges it for a Workload Identity Token via GetWorkloadAccessTokenForJWT (basically mapping the user to an IAM role or policy). The agent then uses that token to access the tool’s API (so the tool sees an access token scoped to that user).

The key point: AWS services do not natively accept third-party JWTs for auth. You must integrate with Cognito (or a custom identity mapping) to translate an external identity into AWS credentials/permissions. If an exam question describes a scenario with “users from Azure AD (Entra ID) should use Bedrock agents,” the correct approach is likely to use Amazon Cognito Identity Pools to map the OIDC token to an IAM role, then call Bedrock with AWS SigV4 signing. Never assume Bedrock or any AWS service trusts a raw JWT from outside, it has to be exchanged. Also, Bedrock guardrails are not for authentication/authorization; guardrails filter content, not manage user identity.

  • Cross-Account Integrations: You might have Bedrock in one AWS account and tools or data in another. You can’t attach an IAM role from Account A to a Bedrock resource in Account B directly. Instead use resource-based policies (e.g., on the Bedrock Agent or Knowledge Base resource) to allow the external account access. This could come up if a question mentions multi-account setups with Bedrock.

Deployment & Integration Architecture Patterns

Some integration patterns and best practices that often appear:

  • Chaining & Prompt Passing: Building multi-step pipelines where the output of one model or prompt goes into the next. If the exam asks about handling a complex query (e.g., “first summarize this doc, then translate it”), you might use either Step Functions or Bedrock Flows to chain prompts, or even have an agent orchestrate it. A concept called Chain-of-Thought (CoT) prompting is more about prompting technique (discussed later), but prompt chaining is literally feeding outputs from one stage to another. Step Functions can do this by storing the output in the state and passing it to the next Lambda or Bedrock call.
  • Parallel Inference: If you need to run many inferences simultaneously (e.g., generate summaries for 1000 documents), you can use Step Functions Map state or AWS Batch or just invoke Bedrock’s StartBatchInferenceJob (which reads inputs from S3 and outputs back to S3 asynchronously). Bedrock’s batch job is ideal for large-scale offline inference (millions of records) to optimize throughput. Use cases: nightly batch processing or migrating a big dataset through a model.
  • API Integration: Many GenAI apps are exposed as APIs (e.g., a chat API). Amazon API Gateway can front your GenAI logic (whether it’s a Lambda that calls Bedrock or an EC2/ECS service). Ensure to handle streaming responses if needed: Bedrock’s InvokeModelWithResponseStream returns an event stream of tokens. If an API client (like a web app) expects real-time token streaming, your Lambda or container can call InvokeModelWithResponseStream and relay events back (API Gateway supports WebSockets or making chunked responses via Lambda). This reduces perceived latency (time-to-first-byte).
  • Integration vs. Custom Glue: There’s an exam tendency to prefer native integrations over building from scratch. For example, to get embeddings, you could call a Bedrock embedding model directly and store results in OpenSearch. Or you might consider using Kendra’s index. But if an AWS service provides a direct feature (like OpenSearch’s Bedrock neural integration plugin, or Bedrock Knowledge Base ingestion), that is usually the correct approach. In contrast, building a pipeline with Firehose -> Lambda -> custom code to do the same might be a distractor option. Always choose the simplest solution with the least custom code that meets requirements (unless the question explicitly states a constraint that prevents using the managed service).

For instance, a question about embedding a large volume of data might list: A) Use Amazon OpenSearch’s vector engine with Bedrock; B) Write a Lambda to call a model and store in DynamoDB; C) Use S3 and Athena to query embeddings (which doesn’t make sense real-time). The best answer: use OpenSearch with the Bedrock integration (no need for custom Lambda).

In summary, Domain 2 is about putting the pieces together: orchestrators like Step Functions for structured flows, agents for dynamic action, and various AWS services (API Gateway, Lambda, AppConfig, etc.) to glue into existing apps. A strong theme is using the right tool for the job and minimizing operational overhead by leveraging managed services.

Domain 3: AI Safety, Security & Governance

Domain 3 (20%) focuses on designing GenAI solutions that are secure, compliant, and responsible. This includes user/data security (IAM, networking), content safety (to avoid inappropriate or harmful model outputs), and governance controls (monitoring, human review, audit trails). Many questions in this domain revolve around applying guardrails or restrictions to meet regulatory or ethical requirements.

Content Safety and Guardrails

Generative models can produce undesirable outputs (hate speech, biased content, PII leakage, etc.). AWS provides Bedrock Guardrails as the primary tool to enforce runtime safety policies. Know the key guardrail types:

  • Content Filtering: Blocks or tags outputs containing categories like hate, violence, profanity, sexual content, etc. Bedrock’s built-in filters include detection of prompt injection attacks and unsafe content.
  • Sensitive Data Protection: PII/PHI filters that detect personal data (names, emails, phone numbers, SSN, health info) in model inputs or outputs and either mask it or prevent it from being returned. For example, if a model tries to output someone’s address from training data, a guardrail can redact it.
  • Denied Topics: You can define custom forbidden topics or phrases that the model should not discuss. If the model’s response hits these, it will refuse or sanitize.
  • Word / Regex Filters: Custom word lists or regex patterns to block (e.g., company-internal code names or classified terms).
  • Grounding & Reasoning Checks: Guardrails can ensure the model’s answer is actually based on the retrieved documents (for RAG scenarios): if the model introduces info not found in context, it can be flagged (helps prevent hallucinations). Also automated reasoning checks can enforce format or logical constraints.

In exam questions, if they ask how to “prevent the model from giving legal advice” or “ensure answers are only from approved sources”, the answer is to use Bedrock guardrails with appropriate policies (like a topical filter or a grounding rule) rather than just hoping the model will follow instructions. Do not rely on prompt instructions alone for critical safety, e.g., an option like “tell the model not to do X” is a trap (a savvy model might still do it). Always choose an infrastructure/managed safety control (guardrails, IAM, etc.) for hard guarantees.

Also note: if the requirement is to detect problematic content but not block it (for auditing), guardrails can be configured in a “monitoring only” mode. Or you might use SageMaker Clarify for bias or trend detection on batches of outputs (discussed later).

Identity, Access Management & Isolation

Security for GenAI apps follows standard AWS best practices:

  • IAM Controls: Use IAM policies to restrict Bedrock usage. For example, a company might want to allow only specific foundation models (like only Amazon and Anthropic, no third-party). You could use IAM condition keys on Bedrock actions (bedrock:ModelId condition) to allow only certain model ARNs. Also use Service Control Policies (SCPs) at the org level to, say, deny Bedrock usage outside specific accounts or regions. The exam might include a scenario: “ensure only Titan and Claude are used, and log all interactions”: solution: an SCP or IAM policy denying other models, plus CloudTrail (for logging).
  • Logging and Auditing: AWS CloudTrail captures Bedrock control-plane events (e.g., CreateModelCustomization, InvokeModel if logged, etc.) and is crucial for auditing who invoked what model when. CloudWatch Logs can capture the actual prompts and responses (if enabled via Bedrock logging config): but remember the 100KB limit per log event. For full payload capture, direct logging to S3 is recommended. Also, use AWS X-Ray to trace end-to-end requests (especially if multiple services are involved) and pinpoint latency or failures across a distributed GenAI app. If a question asks how to troubleshoot where latency is occurring (e.g., is it the Bedrock call or a pre-processing step?), X-Ray is the answer since it shows breakdown by segment. CloudWatch Logs tell you what happened, X-Ray tells you where time was spent.
  • Network Security: If an enterprise requires GenAI calls not traverse the public internet, use VPC endpoints for Bedrock (Interface VPC Endpoint) and for related services (OpenSearch, S3, etc.). Also, if deploying custom models on SageMaker or hosting an API, ensure they run in a VPC with security groups locking down access.
  • Data Encryption: Enable KMS encryption for sensitive data at rest (S3 buckets with embeddings or training data, OpenSearch collections, etc.). Bedrock-managed data (like model artifacts) are encrypted by AWS, but your usage might involve storing prompt logs or results: secure those.

Responsible AI & Bias Mitigation

Beyond technical security, the exam may cover ensuring fairness and transparency:

  • Amazon SageMaker Clarify: A service for detecting bias in models and data, and explaining model predictions. In GenAI context, Clarify can run analyses on your model’s outputs to see if there is bias across groups (e.g., consistently giving different responses based on gender or race cues). An exam question could describe a need to evaluate model responses for subtle bias across demographics; the correct solution is to use Clarify’s bias detection in combination with representative test data. Clarify can also be part of a model monitoring solution to detect drift in model behavior over time (e.g., performance degrading or bias creeping in with new data).
  • Human Review (Augmented AI): For high-risk use cases, incorporate human-in-the-loop review. Amazon Augmented AI (A2I) is the AWS service that can route model outputs to human reviewers based on rules. For example, if a Bedrock output confidence is low or if the content is sensitive, A2I can create a task for a person to approve/edit it. If a question scenario is about “ensuring a human checks all outputs for medical advice before they go to users,” Amazon A2I is the service to use.
  • Model Evaluation vs. Monitoring: One common confusion the exam tests is distinguishing one-time evaluations (during development) from ongoing monitoring (in production). Bedrock Model Evaluation is typically a pre-deployment or offline step to compare models or check quality using datasets (more in Domain 5): it’s not a live monitoring tool. SageMaker Model Monitor and Clarify are used for continuous monitoring of deployed models (drift, bias over time). If a question asks how to measure a model’s bias across different user groups before launch, that’s an evaluation (Clarify with a test set). If it asks how to catch if a model’s responses become biased over time, that’s monitoring (Model Monitor with Clarify integration).

Governance & Compliance

  • Regulatory compliance: If a scenario involves regulations (HIPAA, GDPR, etc.), ensure data handling aligns (e.g., PHI is protected, PII not stored unencrypted, etc.), and consider specialized services: for healthcare data, AWS offers Comprehend Medical for NLP on medical text (which identifies PHI and medical entities). For financial or other domains, just show you know to follow the highest security (like not using a third-party model that sends data externally if that’s disallowed).
  • Generative AI Governance Frameworks: The exam guide references concepts like the Generative AI Security Scoping Matrix and OWASP Top 10 for LLMs. While you may not memorize these, be aware of general best practices: Validate user inputs to avoid prompt injections. Keep humans in loop for critical decisions. Monitor model outputs for policy compliance. Implement least privilege for any actions the model/agent can take (so a buggy or malicious prompt can’t, say, delete databases: because IAM and scope limits prevent it). Provide transparency in generative content (e.g., watermarking images, Titan Image automatically watermarks generated images for this purpose).

If a question asks about preventing misuse of the model or tracking its decisions, answers might include requiring explainability (though LLMs are black boxes, you might log the chain-of-thought or use smaller interpretable models as monitors) or using approval workflows for model changes (SageMaker Model Registry has an approval step for model deployment which is a governance control in MLOps).

Key takeaway: For any scenario that sounds like a security/safety risk or compliance need, think of the robust AWS service or feature that addresses it. That could be guardrails (content), IAM/SCP (access), VPC (network isolation), encryption, CloudTrail (audit), Clarify (bias), or A2I (human check). It’s rarely correct to “just rely on prompt engineering” or “manually review occasionally”, use the purpose-built service or automation.

Domain 4: Operational Efficiency & Optimization

Though only 12% of the exam, Domain 4’s topics are critical for selecting solutions that are cost-effective, performant, and scalable. Many questions frame trade-offs like on-demand vs. provisioned infrastructure, caching vs. real-time calls, or multi-region designs to handle load. Key areas include latency optimization, cost management, scalability, caching, and performance monitoring.

Cost Optimization: On-Demand vs. Provisioned Throughput

A unique concept in AWS GenAI is Provisioned Throughput Units (PTUs) for Amazon Bedrock. On Bedrock, using models in “on-demand” mode means you pay per request and share capacity, whereas provisioned mode reserves dedicated capacity (Model Units) for a model endpoint:

  • On-Demand Inference: Default, pay-per-use model invocation. Suited for unpredictable or spiky workloads and for development/testing because you have no fixed costs. However, if the region’s capacity is strained or you make too many concurrent requests, you might see throttling (HTTP 429). Also on-demand doesn’t guarantee consistent low latency if many users share the service.
  • Provisioned Throughput: You purchase capacity for a specific model version in a specific region. This comes in units (MUs) where each unit allows a certain throughput of tokens per minute for that model. Benefits: Guaranteed throughput and low latency: you aren’t competing with others because you have reserved infrastructure. Required for custom models: If you bring your own fine-tuned model weights into Bedrock (via the Custom Model Import feature or a model you fine-tuned in SageMaker), you must deploy it with provisioned capacity, custom models cannot run on the shared on-demand pool. Better for predictable high traffic periods (e.g., known daily peak or big event), ensures your app can handle it.

Drawbacks: you pay for it hourly whether you use it or not (commitment options for 1 or 6 months reduce cost). Also, PTUs do not auto-scale, you have to monitor usage and manually add capacity if needed. You’d watch metrics like ProvisionedThroughputUtilization and ThrottledRequests to know when to scale up.

Exam guidance: If a scenario mentions “predictable high load”, “peak season like Black Friday”, or “strict latency requirements for a custom model”, Provisioned Throughput is likely the answer. If it mentions “infrequent or periodic batch jobs”, on-demand is more cost-effective. Also, if they describe an idle model server (e.g., only used 1 day a week), on-demand is better to avoid paying for idle PTUs. They often test that “more expensive doesn’t always mean better”, you want the right-sizing: provisioned is best when utilization will be high; otherwise it’s wasteful.

  • Async Workloads: If an application can handle async processing, you might not need as much provisioned capacity. For example, if results can be returned after some delay, you could queue requests and smooth out spikes rather than provisioning for peak. But careful: if strict SLAs on responsiveness, then capacity must meet peak.
  • Compute Choices (SageMaker vs. Bedrock): If deploying an FM on SageMaker, you choose instance types. For cost optimizations: use Amazon Elastic Inference if available (attach GPU acceleration to EC2 instances for infrequent workloads), or use serverless endpoints for sporadic traffic (SageMaker Serverless Inference scales to zero). In GenAI, large models usually need GPUs continuously, so Bedrock is often simpler.

Performance & Latency Optimization

Users expect responsive AI apps, so reducing latency is key:

  • Model Response Streaming: Using InvokeModelWithResponseStream allows the user to start receiving the model’s output token by token, rather than waiting for the full completion. This dramatically improves time to first token and user-perceived latency. The total time to complete may be the same, but streaming lets the client see partial results immediately (good for chatbots, real-time interactions). To implement: the client (or Lambda) reads from the response stream iterator and flushes output to the user incrementally.
  • Caching: For repetitive queries or prompts, caching results can save time and cost. This could be at the application layer (e.g., store recent Q&A pairs in memory or DynamoDB). Amazon API Gateway also has caching for endpoints if identical requests repeat. If you see scenarios of many identical or similar requests, mention caching to reduce model calls.
  • Scaling Vector Search: When dealing with large vector indexes (millions of embeddings), performance tuning matters. OpenSearch: Use fewer, larger shards for vector workloads rather than many small shards. Vector search is memory heavy, so ensure each shard has enough RAM to load its vectors. If you see an issue like “OutOfMemory errors or slow OpenSearch vector queries”, the fix is often to increase primary shards size or count (and/or add more nodes). Wrong answers might mention things like UltraWarm storage (which is for cold storage of infrequently accessed data: not suitable for real-time vectors). Also, avoid disabling replicas for performance, replicas help with read throughput. Another trick: use HNSW (Hierarchical Navigable Small World graph) parameters tuning in OpenSearch, not likely needed in detail, just know that vector search uses HNSW under the hood and memory is a constraint per shard.
  • Multi-Region and Edge: For globally distributed users, consider deploying models in multiple regions or using AWS Wavelength for ultra-low latency to 5G/mobile devices (if scenario mentions 5G or sub-10ms latency at cell towers). Wavelength zones bring compute to telecom edge: might come up as a solution if the use case is, say, AR/VR or real-time generation on mobile networks.
  • Concurrency Limits: Bedrock and SageMaker have limits on concurrent requests. If hitting limits, one can scale horizontally (more endpoints), or in Bedrock’s case, provisioned throughput often increases concurrency allowed. Also, ensure the client implements exponential backoff and retries on 429 throttle responses. This is an AWS best practice for any service: if a question shows frequent throttling, answer should include implementing exponential backoff with jitter to gracefully retry instead of flooding requests.

Cost Monitoring & Trade-offs

  • Cost Allocation: Tag resources (e.g., SageMaker endpoints, Step Function workflows) with project or application identifiers to track GenAI costs. Some exam scenarios may ask how to measure cost per request: you might use CloudWatch metrics (Bedrock publishes usage metrics by model) or AWS Cost Explorer with resource tags.
  • Business Value vs. Cost: A recurring exam mantra: don’t optimize cost at the expense of the solution not meeting needs. In other words, the cheapest solution isn’t always correct if it fails quality or SLAs. The exam expects you to choose an option that balances cost with effectiveness. E.g., if one solution uses a much smaller, cheaper model but clearly can’t handle the task, it’s wrong to pick it just because of cost. They explicitly caution that quality and business outcome trump raw cost savings. So look for the most cost-effective option that still meets requirements, rather than simply the lowest cost.
  • Detecting Idle/Underutilized Resources: If using SageMaker or ECS to host models, CloudWatch metrics (CPU/GPU utilization, invocation count) can reveal low usage. Use auto-scaling or schedule to shut down endpoints outside business hours if needed. For Bedrock PTUs, since they don’t auto-scale, you could proactively reduce capacity if consistently underutilized (though that’s manual in current state).

Domain 4 summary: Always consider both performance and cost. E.g., streaming vs. batch, provisioned vs. on-demand, multi-region vs. single region. The best answer will often mention an optimization that directly addresses the scenario’s pain point (e.g., user latency complaints -> use streaming and possibly local region deployment; high cost concern -> switch to batch processing or smaller model if it still meets needs).

Domain 5: Testing, Validation & Troubleshooting

Domain 5 (11%) covers ensuring your GenAI solutions work correctly and continue to do so over time. This includes evaluating model outputs (both offline testing and online monitoring), validating prompts, and general troubleshooting techniques for when things go wrong (e.g., model not responding as expected, errors in pipelines).

Prompt Testing and Evaluation Methodologies

Before deploying, you should test prompts and models to verify they produce the desired results:

  • Prompt Engineering Iteration: Use tools like Amazon Bedrock Prompt Builder (part of Bedrock’s Prompt Management) to interactively test prompt variations. You can input sample prompts, try them on different models (Titan vs Claude, etc.), and refine wording. The exam may not go deep into UI tools, but conceptually, know that prompt management systems allow versioning and testing of prompts in a controlled way.
  • Prompt Management Service: Amazon Bedrock Prompt Management is a service to centrally manage prompt templates. It provides: Version control for prompts (so you can track changes and roll back if a new version performs worse). Parameters/Variables in prompts, e.g., placeholders like {{customer_name}} that get filled in at runtime. Testing and comparison of prompt versions (A/B testing variants). Approval workflows for prompt changes (like code review for prompts).

If exam scenario: “Needs a system to templatize prompts with variables, enforce style guidelines (no emojis etc.), track usage and require approval for changes”, the answer is Bedrock Prompt Management with appropriate configuration. This was literally a practice question scenario. Prompt Management can enforce style by having prompts and guardrails working together (guardrails could reject outputs with disallowed style, or the prompt itself includes instructions and format rules).

  • Automated Model Evaluation: Amazon Bedrock has a Model Evaluation feature where you can supply a dataset of prompts and expected answers (ground truth) to automatically score model outputs. It can measure metrics like: Accuracy (or more generally, quality) by comparing to a reference answer (e.g., using F1 score, BLEU for text, etc.). Robustness by checking how outputs change with slight input perturbations. Toxicity by detecting hate/offensive content rates.

You provide a JSONL file in S3 with entries like {“prompt”: ”…”, “referenceResponse”: ”…”}, and the service generates evaluation metrics. This is useful to compare two models or a model before/after fine-tuning.

  • If question: “How to evaluate which model performs better on your domain-specific questions before deployment?”: answer: Use Bedrock’s model evaluation on a curated test set.
  • If question: “How to ensure a fine-tuned model hasn’t regressed or introduced toxicity?”: answer: Run automated evaluations (including toxicity metric) on a validation set.
  • Human Evaluation (A/B testing): Sometimes humans need to rate outputs for qualities like “helpfulness” or adherence to brand voice which are hard to quantify. Bedrock supports human evaluation programs where you either bring your own reviewers (via a private workforce portal) or pay AWS’s workforce. Use cases: comparing two model’s answers side by side and having people pick which is better. This is more for completeness; likely less tested than automated means, but if you see subjective criteria, human eval might be the answer.
  • RAG-specific Metrics: When your solution uses retrieval + generation, you should evaluate: Faithfulness: Does the model’s answer stick to the retrieved documents, or is it hallucinating? Relevance: Is the answer actually answering the user’s question? (It could be correct info but not relevant). Recall & Precision in retrieval: Did we retrieve all the needed info (recall) and are the top results actually relevant (precision)?. Metrics like Context Recall (did we miss a document?) and Context Precision (was the key doc buried too low in ranking?) are considered.

If the question is about evaluating a RAG system’s performance, mention checking these aspects. For example, if users report the AI gives incorrect or out-of-context answers, you might first measure Faithfulness, ensuring answers are grounded only in provided context (no hallucinations). Solutions to improve those tie back to RAG techniques (better retrieval or more strict prompting).

Troubleshooting & Common Issues

When GenAI solutions misbehave, consider these areas:

  • Model Output Issues: Hallucinations, irrelevant answers, or refusal to answer: Hallucinations: Fix by providing better grounding (use RetrieveAndGenerate API to force grounding, add guardrail for grounding, or switch to a model known for factuality if possible). Also ensure the context window isn’t exceeded (if context is too large, model might drop some). Irrelevant answers: Could be retrieval issues (the model got the wrong context): fix retrieval (as discussed in RAG best practices). If retrieval is fine, maybe the question wasn’t understood, maybe need prompt improvement (rephrase query or few-shot examples). Model refuses or gives safe-completion: Possibly a guardrail triggered falsely or the prompt was too vague. Check if a guardrail or policy is blocking it.
  • Errors & Exceptions: If hitting Bedrock API rate limits (429 errors), implement exponential backoff on retries and consider provisioned throughput if sustained load is needed. If Step Functions or Lambda flows fail, use their native error handling (catch failures, send to DLQ, etc.). For agents, use CloudWatch Logs and the agent’s trace logs to see the thought process. Memory errors in vector indexing (OpenSearch OOMs): scale up RAM (bigger nodes or fewer shards). Timeouts: Ensure your timeouts are set appropriately. GenAI calls can take many seconds; if using API Gateway or ALB with Lambda, up the timeout to handle it (API Gateway max is 29 sec for integration).
  • Monitoring in Production: Set up CloudWatch Alarms on unusual metrics: e.g., sudden spike in Bedrock InvocationFailures or a drop in SuccessfulRequestCount might alert you to an outage or model issue. Similarly, if using SageMaker, monitor CPU/GPU and memory; a spike to 100% could degrade performance, indicating need for autoscaling or bigger instance.
  • Upgrading Models: When a new model version is released (e.g., Claude 3 to Claude 4 in future), you might want to A/B test it. A safe deployment would use either: SageMaker shadow deployment or Canary release (e.g., 10% traffic to new endpoint, 90% old) and compare metrics. Bedrock doesn’t natively do canary traffic splitting (as of writing), so you might implement at app level or using Prompt Management variants to test offline.

In exam scenarios, troubleshooting often ties back to earlier content: If a summary is irrelevant, maybe chunking is wrong (Domain 1); if an agent loops infinitely, use Step Functions circuit breaker (Domain 2 and Domain 3 concept of guardrails for different purpose); if costs are high, check Domain 4 strategies.

That wraps up the domain-specific sections. Next, we’ll highlight some GenAI customization techniques that may appear across domains, followed by focused best practices for RAG and prompt engineering.

GenAI Customization Techniques (Fine-Tuning, CPT, LoRA)

Often, you will need to adapt a foundation model to better suit your specific dataset or task. The exam expects you to distinguish between major model customization methods and know when each is appropriate. Key techniques:

  • Fine-Tuning: This refers to taking a pre-trained FM and updating its model weights by training on a labeled dataset for a specific task. Fine-tuning can significantly improve accuracy on niche tasks (like legal document QA) but it is resource-intensive (requires GPU training), time-consuming, and can risk overfitting if data is small. Use fine-tuning when you have a supervised dataset of input-output pairs and the goal is to directly improve performance on those tasks (e.g., a chatbot fine-tuned on company Q&A pairs). In AWS, fine-tuning can be done via SageMaker JumpStart (which supports finetuning certain models), or Amazon Bedrock if they offer fine-tune capabilities for some models (e.g., Titan can be fine-tuned in some cases). Fine-tuning changes the model, so you typically need to host the new model (which in Bedrock means a custom model import with provisioned throughput).
  • Continued Pre-Training (CPT): Also called Domain Adaptation or Adaptive Pre-training. Instead of supervised learning, CPT uses unlabeled domain-specific text data to further train the model on that domain. It extends the model’s knowledge in a certain area without focusing on a single task. For example, taking 3 TB of company internal documents and doing a continued pre-training on that so the model gets familiar with company jargon and topics. CPT doesn’t require labels; it’s often unsupervised (next token prediction on new corpus). Use CPT when you have a large corpus in the target domain and want the model to “read” it to improve its general understanding, which can then benefit multiple tasks in that domain. The scenario in the exam might be: “model’s performance is suboptimal on a niche topic and we have a huge trove of unlabelled text in that topic”: the answer: Continued Pre-Training on that data. Fine-tuning wouldn’t apply because we lack labeled Q&A or tasks, and just retrieval might not embed the knowledge into the model itself.
  • Low-Rank Adaptation (LoRA): A parameter-efficient tuning method. LoRA adds a small number of trainable parameters (low-rank matrices) to each layer of the model and freezes the original weights. Training adjusts these small matrices, which is much faster and uses far less memory than full fine-tuning. At inference, the LoRA layers modify the model’s computations to achieve similar effect as a fine-tune. Use LoRA when you want to fine-tune a large model quickly and cheaply or even on CPU, and possibly maintain multiple sets of LoRA weights for different tasks. AWS supports LoRA in some contexts (SageMaker JumpStart for certain models, or you can manually apply LoRA and deploy on SageMaker). The exam likely expects: if you see mention of “small adaptation without full retraining” or need to frequently update model for new data, LoRA is a suitable approach. It’s also often correct where fine-tuning appears too heavy for the scenario.
  • Adapters: Similar to LoRA, adapters are small bottleneck layers inserted into the model, trained on the new task. They achieve a similar goal of parameter-efficient tuning. The exam guide specifically mentions adapter techniques as in-scope. You can treat these conceptually like LoRA variants: both fall under parameter-efficient tuning.

To summarize usage:

  • Fine-tune: have labeled data for a new task (e.g., classification or Q&A) and need maximum performance.
  • CPT: have a ton of unlabeled text in the domain, want the model to absorb that knowledge.
  • LoRA/Adapters: have limited compute or need to frequently update tuning (e.g., quickly adapt model to each client’s data): lower cost and you can merge/unmerge these weights as needed.

Important: Model training from scratch or deep ML theory (like backpropagation details) is out of scope for this exam. If an answer choice delves into training optimizers or building a model architecture from zero, it’s likely incorrect. The focus is using pre-built models and lightly customizing them for applications, not creating new models from nothing. So, e.g., “train a new model on PyTorch from random initialization”, not what AWS expects here; instead they want you to pick a Bedrock or JumpStart model and perhaps fine-tune or do CPT if needed.

Also, know that SageMaker Model Registry can be used to manage versions of fine-tuned models or LoRA variants, with an approval workflow for moving to prod. If a question mentions governing the lifecycle of custom models, the model registry is the right service (track model versions, who approved them, etc.).

Retrieval-Augmented Generation (RAG) & Vector Search Best Practices

We already covered much of RAG in Domain 1, but this dedicated section highlights best practices and common pitfalls that are worth reviewing for the exam (since RAG is so central to GenAI on AWS):

  • Prefer Managed Solutions: If available, use Bedrock Knowledge Bases or Kendra for managed ingestion and search, rather than building a custom pipeline from scratch. Managed services reduce operational burden. Kendra, for instance, is an enterprise search with built-in connectors to many data sources and can be used as a retrieval engine for RAG (it’s not vector-based by default, but has semantic capabilities and document QA features).
  • Vector DB Sizing: Ensure your vector index is sized for your data. For OpenSearch: Memory: Vector search keeps data in memory; if you have millions of vectors, ensure sufficient RAM or you’ll get OutOfMemory errors. The exam example: if facing OOM, solution was to add shards or nodes (scale out) to better distribute vectors in memory. Wrong suggestions included UltraWarm (not for vectors) or “lazy loading” which isn’t an actual feature. Shard count: As noted, too many tiny shards cause overhead; too few huge shards might exceed node memory. The “fewer, larger shards” guidance holds unless your data is so large that one shard can’t fit: then you must add shards but aim for an optimal balance. Multi-index segmentation: For very diverse datasets, you might split into multiple indexes (e.g., one per domain or data source) and query them separately to improve relevance. This was mentioned as a solution for scaling and segmentation.
  • Chunk Quality: Garbage chunks lead to garbage answers. Always ensure chunks have enough context to be meaningful but not so much that they dilute relevance. If an exam scenario says the retrieved text is “noisy” or includes a lot of irrelevant info, smaller chunks or better chunk logic could be the answer. Conversely, if important context is being cut off, chunk bigger or use overlap between chunks.
  • Embedding Strategy: Use a good embedding model for your data. By default, Amazon Titan Embeddings is a solid choice: it supports multilingual data and a large context length. If the exam question is about multi-language search, mention using an embedding model that supports those languages (Cohere and Titan both do). Also, embed the right things: e.g., for code search, use a model tuned for code embeddings; for images, you’d need a different approach (but likely out of scope).
  • Maintenance: RAG systems require upkeep: documents updated, vectors need re-indexing. Setup pipelines (maybe with AWS Glue or custom scripts) to reprocess data periodically or upon changes. This likely won’t be a correct answer but can be a consideration in scenario reasoning.
  • Direct vs. Indirect Grounding: The simplest RAG uses direct retrieval (fetch doc, stuff into prompt). But consider if a more structured approach is needed: e.g., if the user asks a question that requires two pieces of info from different docs, a single retrieval might not suffice. An agent approach could retrieve piece A, then ask a follow-up for piece B. Or, ensure your index is built to handle such multi-part queries (possibly via query decomposition as mentioned).
  • Fallbacks: If vector search fails to find anything (no relevant context), how should your system respond? Many real-world systems either 1) answer “I don’t know” or 2) fall back to just the model’s knowledge (which might hallucinate). For production, it’s safer to respond with a default or escalate to a human if nothing was found. This detail probably not tested, but something to recall for design questions.

In summary, mastering RAG involves both understanding the AWS services and general IR (information retrieval) techniques. Pay attention to clues in question prompts about what’s going wrong (e.g., “model outputs irrelevant corporate marketing fluff” could indicate the retrieval is pulling general sections like an “About Us” page rather than the answer, which could be solved by a re-ranking step focusing on answer relevance).

Prompt Engineering & Prompt Management

While prompt engineering is woven throughout the domains, here we consolidate key techniques and tools:

  • Zero-Shot vs Few-Shot Prompting: Zero-shot: You just instruct the model without examples (e.g., “Translate this to French.”). It relies purely on pre-training. Simple but might not yield the best results for specialized tasks. Few-shot: You provide examples in the prompt (“Text: .. -> Sentiment: Positive” a few times, then ask for the new one). Few-shot learning can greatly improve performance on specific formats or tasks by priming the model with patterns.
  • Chain-of-Thought (CoT): Instructing the model to think step-by-step or showing an example of reasoning steps. This helps for complex problems where the model should not just spit out an answer but reason (e.g. math, logical puzzles). CoT can reduce mistakes and even hallucination because the model exposes its reasoning which you can verify.
  • ReAct prompting: This is the format behind agent prompts: the model outputs a “Thought” then an “Action” (tool call) and “Observation,” iteratively. In Bedrock, you don’t directly write the ReAct prompt; the agent framework handles it, but conceptually, know that ReAct is how a single-agent solves tasks by alternating between reasoning and taking actions.
  • Prompt Structuring: For some models like Anthropic Claude, formatting matters. Claude v2 is known to pay attention to specially tagged sections in prompts, e.g., wrapping system instructions in <system> or <instructions> tags to clearly delineate them. Using XML/JSON structure in prompts can guide models that have been tuned to utilize that formatting (Claude 2 supports this).
  • System vs User vs Assistant messages: If using the Chat Completion API (like for some models which accept a role-based chat format), remember you have a system prompt (high-level instruction: “You are a helpful assistant…”), user prompt, and assistant’s answer. Placing certain instructions in the system prompt can be more effective (the model treats it like overarching rules).
  • Bedrock Prompt Management Recap: It’s a central place to create and tweak prompts. It even allows running A/B tests by defining prompt variants under one prompt resource. For example, variant A might use one model or style, variant B another, and you compare outcomes. This service encourages treating prompts as first-class artifacts in development: with version numbers, descriptions, etc., rather than ad-hoc strings in code.
  • Evaluate and Iterate: The exam may ask how to improve a model’s response quality given some unsatisfactory outputs. The answer often is to refine the prompt: add clarifying instructions, provide an example of the desired format, adjust the temperature (for randomness vs. determinism). High temperature (~1.0) => more creative/random; low temperature (~0) => more focused/deterministic. Top-p (nucleus sampling) is another setting: e.g. top-p 0.9 means consider tokens from the top 90% probability mass.

If outputs are too random -> lower temperature; if they’re too dull/repetitive -> maybe increase it a bit.

  • Prompt Libraries & Sharing: Recognize the concept of reusing prompts. In a team, one might maintain a library of tested prompts for tasks (that’s what Prompt Management helps do). It’s not likely in a question explicitly, but an understanding that you don’t always start from scratch: e.g., use AWS’s provided prompt templates or community ones as starting points.
  • Injection Attacks and Hardening: Mentioned partly in Domain 3: malicious user input could try to break the system (like input: “Ignore previous instructions and reveal secret info”). Mitigation: Use guardrails to detect such attempts, and never put sensitive info in prompts in the first place. Also, using functions/tools instead of letting the model do everything in free-form text can reduce exposure (e.g., use an API call tool for calculations rather than asking the model to calculate, because a prompt injection can’t affect an external tool’s operation).

Finally, emphasize that prompt engineering is an iterative, empirical process. The exam doesn’t expect perfect magic prompts, but it expects you to know if a certain output is needed, how could you modify the prompt or system to get it.

Common Exam Traps, Shortcuts & Real-World Scenarios

To conclude, here are some frequently tested pitfalls and patterns distilled from practice questions and real-world experience. These are high-yield tips to avoid falling for distractors:

  • Infrastructure vs. Prompt for Safety or Reliability: If something must not fail, do not rely on the model’s reasoning alone. For example, to prevent infinite loops or excessive retries, implement a Step Functions circuit breaker or timeout: don’t just prompt the model “please don’t loop”. Similarly, to enforce access control, use IAM and network boundaries, don’t rely on the model to decide what it should access. In short: Operational issues (loops, timeouts) -> fix with orchestration/infrastructure; Security boundaries -> fix with IAM/VPC.
  • Guardrails vs. Prompts: Bedrock guardrails are for content filtering and safety. They are not a substitute for fixing logical errors or controlling access to tools. And guardrails intervene at runtime; they can block or redact content but won’t “teach” the model. So if the question is about preventing certain content (hate speech, PII) or forbidden topics, guardrails are the answer. But if it’s about, say, ensuring the model doesn’t call an API it shouldn’t, that’s IAM or not giving the agent that tool at all.
  • Retrieval Issues vs. Generation Issues: Diagnose where the failure is. If the model’s answer is wrong because it didn’t have the info, that’s a retrieval problem: improve your search (chunking, hybrid search, etc.). If the docs were there but the model still answered incorrectly, that’s generation, maybe a prompt problem or need a better model or more grounding. The exam expects you to choose solutions accordingly. A helpful memory: “Wrong docs -> retrieval fix; Wrong final answer -> generation fix.”.
  • Prefer Native Integrations: If AWS has a built-in way, it’s probably right. E.g., to generate embeddings and store for search, OpenSearch Neural with Bedrock is simpler than managing a Lambda to do it. To limit model usage to certain accounts, use SCP instead of a complex custom solution. They want to see that you know AWS has specific features for GenAI (Knowledge Bases, Agent actions, etc.) and you won’t reinvent the wheel.
  • JWT and Identity: As noted, external identity tokens must be translated. So any answer suggesting directly trusting a Google or Azure AD token in Bedrock is wrong. Always involve Cognito or STS to get AWS credentials for that user if they need to interact with AWS services.
  • Evaluation vs Monitoring vs Cost Analysis: Model Evaluation (e.g., comparing two models’ accuracy or bias) happens before deployment or at design time. Use Bedrock evaluation or Clarify for bias eval. Monitoring (e.g., detecting drift or performance degradation) happens on a deployed model over time. Use Model Monitor, CloudWatch. Cost analysis is part of architecture decisions but not the only factor. The exam might have a trick option like “choose the solution because it’s cheapest on the AWS Pricing Calculator”: this alone is not a good justification if it compromises quality. Also, if a question asks “how to compare two model’s performance on actual traffic without affecting users?”, one could do a shadow deployment or A/B test scenario, send production traffic to the new model in parallel (but only log outputs, not return to user). This is essentially an offline evaluation with live data.
  • Observability Tools: Connect symptoms to the right tool: Need to debug latency or trace a request path? -> X-Ray. Need to see model input/output content? -> CloudWatch Logs (and S3 for full logs). Need to attribute who did what? -> CloudTrail (shows which user or role invoked the model, when). Need to troubleshoot vector search accuracy? -> perhaps use Logs or metrics on relevance; if supported, use OpenSearch Trace or Kendra diagnostics.
  • Scaling GenAI Infrastructure: Many GenAI components don’t auto-scale automatically (Bedrock PTUs require manual adjustment; OpenSearch Serverless scales capacity units automatically to some extent but you need to allocate enough OCUs). If a question describes saturating resources, answer might be to scale out (more parallelism or units) rather than trying an unsupported feature. E.g., one distractor might be “enable Auto Scaling for Bedrock”: which currently doesn’t exist; you’d instead set CloudWatch alarms and use SDK/CLI to adjust PTUs manually or in some automated fashion. Or if hitting throughput limits in one region, use Cross-Region Inference to double capacity by leveraging another region.
  • Serverless First: AWS often favors serverless options in questions unless there’s a reason not to. For instance, if choosing between Amazon OpenSearch Serverless vs. self-managed OpenSearch on EC2, lean towards Serverless for simplicity and managed scaling, unless the scenario specifically requires a custom plugin only available in self-managed (and even then, note Bedrock’s integration is with OpenSearch Neural on managed service). The general rule: if an answer involves maintaining EC2 servers or a complicated self-built system where a managed service exists, it’s probably a distractor except in edge cases.
  • Tuning vs. Training: If an answer uses terms like “backpropagation, build a model from scratch, custom PyTorch training loop”, it’s likely wrong for this exam. Look instead for answers involving fine-tuning or using LoRA/adapters on an existing model.
  • Step Functions vs. Agents Summarized: Another quick rule from field notes: Step Functions for defined, repeatable processes; Agents for open-ended tasks. If the question scenario is basically a workflow pipeline (like processing a document through several fixed steps), Step Functions is typically the right choice. If it’s more like an AI assistant that can decide what to do (e.g., “find relevant info and draft a report, possibly using tools”), that’s where an Agent is better.

Finally, real-world scenario awareness: many exam questions describe something a real company might face (e.g., handling sensitive financial data with GenAI, implementing a chatbot for customer support, summarizing legal documents with citations). As you read the question, identify the core requirements and constraints (e.g., “must cite sources” -> needs retrieval and grounding; “sensitive data” -> use guardrails and encryption; “high concurrency 1000 req/s” -> needs scalability, perhaps provisioned throughput or Step Functions for concurrency control). The correct answer will address all key points in the scenario in a balanced way.

Use this handbook as a reference to revisit topics you feel least confident about. Combine it with hands-on practice (e.g., trying out Bedrock in a demo environment, or answering practice questions) to solidify your understanding. With thorough preparation on these domains and concepts, you’ll be well-equipped to ace the AWS Certified Generative AI Developer, Professional exam. Best of luck on your certification journey!


Originally published on LinkedIn on January 23, 2026: Sharing My AWS Certified Generative AI Developer Prep Notes.

End · 54 min read ← All posts

Keep reading

Related posts