AI Lexicon

Plain-English definitions for the vocabulary behind LLMs, RAG, prompt engineering, agents, harnesses, evals, tools, and production AI workflows.

Start here

Model language

Learn the basics: models, LLMs, tokens, context, inference, and hallucinations.

Build systems

Retrieval and agents

Connect models to knowledge, tools, state, permissions, and human review.

Ship it

Harnesses and evals

Wrap AI in tests, schemas, traces, guardrails, and production feedback loops.

AI model

A trained program that turns inputs into predictions, text, images, code, audio, or actions.

A model learns patterns from data during training. In day-to-day products, you usually use the finished model through an app or API.

GPT, Claude, Gemini, and image generators are all models with different strengths and input types.

LLM

A large language model that predicts and generates text one token at a time.

LLMs are useful for writing, summarizing, coding, planning, extracting data, and calling tools because they can follow language instructions.

When you ask ChatGPT to draft an email, an LLM is generating the next tokens that form the answer.

Token

A small chunk of text the model reads or writes, often part of a word.

Models do not process text exactly as words. They process tokens, and token count affects context limits, latency, and cost.

The sentence "Build an app" might be split into several tokens before it reaches the model.

Context window

The maximum amount of text, files, tool results, and prior messages a model can consider at once.

A larger context window lets you include more source material, but it does not guarantee the model will use every detail well.

A coding agent may need repository files, the task, error logs, and recent edits inside the context window.

Inference

The moment a trained model runs and generates an answer for a specific request.

Training creates or improves the model. Inference is the everyday use of that model in a chat, app, agent, or API call.

Every time a chatbot streams a response, it is doing inference.

Multimodal model

A model that can work across more than one type of input or output, such as text, image, audio, or video.

Multimodal models let AI systems inspect screenshots, understand documents, generate images, or reason over spoken content.

A support agent can look at a user screenshot and explain what setting to change.

Hallucination

A confident answer that is false, unsupported, or invented.

Hallucinations happen because models generate plausible continuations, not guaranteed truth. Grounding, retrieval, and evals reduce the risk.

A model cites a policy that sounds real but does not exist in the company handbook.

Prompt engineering

Writing instructions, context, examples, and constraints so a model reliably does the job.

Good prompts define the task, audience, inputs, output format, boundaries, and success criteria. For complex systems, prompts become part of the product architecture.

Instead of "summarize this", ask for three risks, two decisions, and owners in JSON.

System prompt

The high-priority instruction that defines the AI system role, rules, style, and boundaries.

A system prompt is normally hidden from the end user. It sets durable behavior such as "answer as a support agent" or "never reveal secrets".

A coding assistant system prompt may require tests before finalizing code changes.

Few-shot prompting

Including a few examples of desired input and output so the model can copy the pattern.

Few-shot examples are useful when the output style or classification rule is hard to explain in abstract instructions.

Show three customer messages and their labels before asking the model to label a new one.

Structured output

A response constrained to a predictable shape, usually JSON that matches a schema.

Structured outputs make model responses easier to validate, store, render, and pass into downstream code.

Ask for `{ "summary": string, "priority": "low" | "medium" | "high" }` instead of prose.

JSON schema

A formal description of the fields, types, and allowed values expected in JSON.

Schemas help apps reject malformed model output and make integrations safer than parsing free-form text.

A lead-scoring model can be forced to return a number, reasons, and a next action.

Context engineering

Designing what information enters the model context and in what order.

Context engineering includes selecting files, summaries, tool results, examples, memories, and instructions while avoiding noise.

A code agent performs better when it sees the failing test, relevant files, and project rules instead of the whole repo.

Prompt injection

An attack or accidental input that tries to override the AI system instructions.

Prompt injection matters when models read untrusted text, web pages, emails, tickets, or documents and then take actions.

A webpage tells the browsing agent to ignore previous instructions and leak private data.

RAG

Retrieval augmented generation: fetch relevant knowledge first, then ask the model to answer using it.

RAG reduces hallucination and keeps answers current without retraining the model. It depends on good retrieval, chunking, ranking, and citations.

A company chatbot searches the policy docs before answering an HR question.

Embeddings

Numeric representations of text, images, or other data that capture meaning for similarity search.

Embeddings let systems find related content even when the exact same words are not used.

A search for "refund" can find a document that says "money back guarantee".

Vector database

A database optimized for storing embeddings and finding the nearest matches.

Vector databases are common in RAG systems because they make semantic search over large knowledge bases practical.

Store one embedding per document chunk, then retrieve the closest chunks for each question.

Chunking

Splitting large documents into smaller pieces before indexing or sending them to a model.

Chunk size affects retrieval quality. Too small loses context; too large brings irrelevant material into the answer.

A 40-page PDF can be split by headings so each policy section becomes a searchable chunk.

Reranking

A second pass that reorders retrieved results by relevance before the model sees them.

Reranking improves RAG when the first search returns many plausible but uneven matches.

Retrieve 30 chunks from a vector search, then rerank and keep the best 6.

Grounding

Tying an answer to supplied evidence instead of letting the model rely only on memory.

Grounded systems tell the model which sources to use and often require quotes, citations, or extracted facts.

Answer only from the uploaded contract and say when the answer is not present.

Citations

References that show which source supports a claim in the model answer.

Citations are only useful when they point to retrieved evidence actually used by the answer, not just nearby documents.

A research assistant links each bullet to the report section it came from.

AI agent

An AI system that can decide steps, use tools, and work toward a goal across multiple turns.

Agents combine a model with instructions, tools, memory or state, permissions, and a control loop.

A coding agent reads a bug report, edits files, runs tests, and summarizes the fix.

Tool calling

Letting a model request a specific tool, such as search, database lookup, code execution, or email draft.

The app decides whether to run the tool, passes results back to the model, and controls permissions.

The model calls a calendar tool to find open meeting slots.

Function calling

A structured way for a model to choose a named function and provide arguments.

Function calling is often the implementation detail behind tool calling. It turns model intent into code your app can validate.

The model returns `create_ticket({ title, priority })`, and your app creates the ticket.

MCP

Model Context Protocol: a standard way for AI apps to connect to external tools and data sources.

MCP servers expose capabilities such as files, databases, issues, calendars, or design tools through a common interface.

A coding assistant uses an MCP server to read GitHub issues without a custom integration.

Planner

The part of an agent that breaks a goal into steps and decides what to do next.

Some agents plan explicitly with visible steps. Others plan implicitly through the model prompt and control loop.

An agent first inspects the repo, then edits, then runs checks, then reports back.

Memory

Information an AI system stores or retrieves across turns, sessions, or users.

Memory can be a chat summary, user preference, project fact, vector store, or database record. It should be scoped and editable.

A tutor remembers that the learner prefers examples in Python.

Human in the loop

A design where a person reviews, approves, edits, or escalates important AI actions.

Human review is essential for high-impact actions, uncertain outputs, irreversible changes, or sensitive workflows.

An agent drafts a refund email, but a support lead approves it before sending.

Permissions

Rules that define what data and actions an AI system is allowed to access.

Good agents separate reading, writing, deleting, spending money, and external communication into explicit permission levels.

A code agent can read files automatically but must ask before pushing to production.

Workflow

A repeatable sequence of steps that turns inputs into a useful output.

Not every AI system needs to be a free-form agent. Many production systems are reliable workflows with a few model calls.

Upload invoice, extract fields, validate totals, route exceptions, and save to accounting.

Orchestration

Coordinating model calls, tools, retries, routing, memory, and human review.

Orchestration is the application logic around the model. It decides when to call which model or tool and what happens next.

A support system routes billing questions to RAG, bug reports to Linear, and account changes to approval.

Harness

The wrapper around a model or agent that supplies inputs, tools, checks, logs, and outputs.

A harness makes AI behavior testable and repeatable. It is where prompts, schemas, tool definitions, and validation meet code.

A customer-email harness loads the email, asks the model for a category, validates JSON, and logs the result.

Eval harness

A test setup that runs AI tasks against examples and scores the outputs.

Eval harnesses catch regressions when prompts, models, retrieval, or tools change. They are the AI version of product tests.

Run 200 support tickets through the new prompt and compare accuracy before deploying.

API

A programmatic interface that lets one app call another service.

Most AI products call model APIs to send messages, upload files, define tools, stream responses, and receive structured outputs.

Your app sends a user question to a model API and streams the answer back into the UI.

SDK

A software development kit that provides libraries and helpers for using an API.

SDKs reduce repetitive code for authentication, streaming, file uploads, retries, and tool definitions.

A JavaScript AI SDK can stream model output into a Svelte or React component.

Webhook

A URL another service calls when an event happens.

Webhooks are common in AI automation because they start workflows from external events like a new ticket, payment, or message.

A new form submission triggers a webhook that asks an AI model to qualify the lead.

State

The saved facts that describe where a workflow, user, or agent currently is.

State keeps systems from treating every step like a brand-new conversation. It can include status, history, decisions, and pending actions.

An agent remembers that tests failed once and should inspect the failing file next.

Evaluation

Measuring whether an AI system produces useful, correct, safe, and consistent results.

Evaluation can be manual review, automated tests, model-graded scoring, exact-match checks, or real user feedback.

Score summaries for accuracy, missing facts, tone, and whether they cite the right source.

Benchmark

A shared test or leaderboard used to compare models or systems.

Benchmarks are useful for direction, but your own evals matter more because they match your data, users, and failure modes.

A model wins a coding benchmark but still performs poorly on your private codebase conventions.

Guardrails

Rules and checks that constrain what an AI system can say or do.

Guardrails can be prompts, schema validation, content filters, permission checks, human approval, or post-processing.

Reject model output that asks for a wire transfer without manager approval.

Temperature

A setting that controls how varied or conservative model outputs are.

Lower temperature usually gives more consistent answers. Higher temperature can add variety but may increase mistakes.

Use low temperature for invoice extraction and higher temperature for brainstorming campaign names.

Top-p

A sampling setting that limits generation to the most likely next-token options up to a probability mass.

Top-p is another control for output variety. Many teams tune either temperature or top-p, not both at once.

Reducing top-p can make responses less surprising and more repeatable.

Rate limit

A cap on how many requests or tokens can be used in a time period.

Rate limits protect providers and your own budget. Production apps need queues, retries, and user-visible fallback behavior.

Your app may be allowed 10,000 tokens per minute before requests are slowed or rejected.

Observability

The logs, traces, metrics, and dashboards that show what an AI system is doing.

AI observability tracks prompts, tool calls, retrieved sources, model versions, token use, latency, errors, and user feedback.

When an answer is wrong, observability shows which documents were retrieved and what prompt was sent.

Tracing

Recording each step in an AI workflow so you can debug what happened.

A trace can include the user request, model calls, tool calls, retrieved chunks, outputs, errors, and timings.

A trace reveals that an agent used the wrong tool before producing a bad answer.

Fine-tuning

Additional training that adapts a base model to a specific task, style, or domain.

Fine-tuning is useful when examples are stable and prompts are not enough. It does not replace good product design or retrieval.

Fine-tune a smaller model to classify internal support tickets in your company taxonomy.

LoRA

Low-Rank Adaptation, a lightweight way to adapt a model without retraining every parameter.

LoRA creates small adapter weights that can be cheaper to train and swap than full fine-tuning.

An image model uses a LoRA adapter to learn a specific illustration style.

Distillation

Training a smaller model to imitate a larger or stronger model.

Distillation can reduce cost and latency when the task is narrow enough for a smaller model to handle.

Use a top model to label examples, then train a cheaper classifier on those examples.

Latency

How long a user waits for an AI system to respond.

Latency depends on model choice, input size, output length, tool calls, network time, caching, and streaming.

A search-and-answer workflow may feel slow because retrieval, reranking, and generation all happen before the final answer.

Cost per token

The price charged for model input and output tokens.

AI cost is usually tied to how much context you send, how much text the model generates, and which model you use.

Summarizing long transcripts with a premium model costs more than classifying a short message with a smaller model.

Caching

Reusing previous work so the system does not repeat the same expensive operation.

Caching can store prompts, retrieved results, embeddings, generated summaries, or full model responses when inputs are stable.

Cache embeddings for uploaded documents instead of recomputing them for every question.

Next move

Use the terms inside the course

The lexicon is the vocabulary layer. The course sections show where these ideas appear in real prompting, automation, research, and AI-assisted work.

Open course