prompt engineer · hermosa beach, ca

> Marcus Hobbs

#AI Innovation Engineer — building the meta-tools engineering teams use to ship AI features safely

I design the systems engineering teams use to ship AI features safely — multi-agent code-change workflows, subagent contracts, schema-first prompts, evaluation harnesses. My job is to make the AI engineering SDLC itself a designed product.

4×

production apps shipped

8 → 3

weeks to ship

443.

ablation runs · 3 model tiers

// marcushobbs@mac.com · linkedin.com/in/marcushobbs

##about

I build reliable AI coding agents at enterprise scale. The answer isn't better models — it's better prompt engineering and systematic evaluation.

My methodology: let friction surface the gaps. Every time I correct an agent twice, that correction becomes a persistent constraint. The result is behavioral specifications across the full Claude Code toolkit — CLAUDE.md, Skills, Rules, Hooks, Commands, Memory — that make agents reliable from the first prompt.

This approach has shipped 4 production applications with compounding velocity. Each project faster than the last as learnings accumulate.

I'm now focused on a level up: from making one agent reliable to making the system an engineering team uses to ship with AI feel safe. Multi-agent orchestration with hard tool-call budgets, file-domain isolation between subagent teams, JSON-schema-enforced outputs, risk-driven mode selection, and durable task-log contracts between planning and execution. The work shifts from "how do I prompt this well" to "how do I design a workflow where the model can't go off the rails — and where the cost of being wrong is bounded by the contract, not by hope."

##methodology

A systematic approach to making AI coding agents reliable at enterprise scale.

"I design systems that trigger their own improvement cycles. Friction detection, learning, and replanning happen continuously — not when I remember to ask."

the L3 thesis — self-improving systems

Every principle below serves this goal: removing the human as the bottleneck for planning, learning, and iteration.

[01] friction

##friction-driven

Bootstrap the loop by letting friction surface what's missing. Don't over-engineer prompts upfront — let failures reveal the gaps, then encode the fixes into persistent context.

[02] loop

##compound-learning

Every solved problem documented in searchable format. Agents search past solutions before planning. The system accumulates intelligence across sessions — knowledge compounds automatically.

[03] meta

##meta-prompting

Claude generates the prompts that Claude executes. Design specs, implementation plans, behavioral constraints — all authored by AI, curated by humans. The engineer designs the system that writes itself.

[04] tree

##context-hierarchy

Full Claude Code toolkit: CLAUDE.md for behavioral specs, Skills for capabilities, Rules for constraints, Hooks for automation, Memory for persistence. Context inheritance flows from project root through the directory tree.

[05] eval

##empirical-prompts

Prompts are hypotheses, not products. Ablation testing reveals load-bearing components. Cross-model validation proves generalization. LLM-as-judge scales evaluation. If you can't measure it, you're guessing.

[06] schema

##schema-first-prompts

The prompt's job is to populate a schema, not to produce prose. Define the JSON output shape first; prompts converge faster, downstream consumers don't need to parse intent, and disagreement becomes typeable.

[07] budget

##budget-the-loop

Hard tool-call budgets, retry caps, and phase boundaries are not pessimism — they're the contract that makes a multi-agent system debuggable. Without budgets, "the agent kept trying" is indistinguishable from progress.

##experience

[2025 ─ present]

Irvine, CA

~/planet-dds

AI Innovation Engineer

Innovate the AI engineering SDLC. Operate in a high-risk-tolerance greenfield product that integrates with low-risk-tolerance brownfield production systems. The throughline: treat the SDLC as behavioral engineering, not artifact production. The work is shaping agent reasoning through ablation, evaluation, and survey; the suites, agents, and playbooks engineering teams use are systems designed to recursively self-improve.
Design and continuously evolve the multi-agent prompt suites engineering teams use to ship production code. Multi-phase workflows (plan → work → review) with subagent teams, file-domain isolation between backend/frontend implementers, hard tool-call and retry budgets, JSON-schema-enforced subagent outputs, and task-log files as the durable contract between phases. Risk-driven Light/Heavy mode selection routes changes by what surfaces they touch — schemas, migrations, auth, services, API contracts. The suites get sharper through incident-driven ablation: every production failure becomes a hypothesis about where the agent's reasoning broke, and the refinement encodes the answer back into the contract.
Design and ship the org's registered subagents to a strict contract — frontmatter metadata, restricted tools, JSON output schemas, categorical probes grounded in actual production incidents, active-fixer (not passive-reporter) under hard tool-call budgets. Evaluate agent behavior continuously against the contract, not just the output: did it fix the right severity tier, did it stay within budget, did it bounce when it should have fixed? The code-review agent fixes high-severity issues in-place; the same contract scaffolds every agent the org ships.
Run an ongoing cross-team pattern-extraction program. Survey teams' prompt suites — slash commands, subagent templates, accumulated tooling — for what generalizes vs what's load-bearing on local context. The empirical test: does the pattern survive translation to a different codebase's conventions? Pull what travels into the shared toolkit; publish teammate-facing playbooks. Recent cycle: ~6,800 lines / 21 slash commands / 5 subagent templates surveyed; five transferable patterns adopted into a different team's greenfield suite; report so other AI engineers can pilot without re-doing the archaeology.
LLM-driven automation outside the IDE: health-check pipelines delivering adaptive cards via launchd-scheduled diagnostics, multi-line prompt substitution via perl + env vars, self-locating data files for scripts deployed outside the repo.

mytooth.io Agentic Marketplace (AI Confirmation Agent)

[2023 ─ present]

~/mastermind-alliance

Mastermind Alliance · open-source AI persona dialogue + ablation research

Open-source AI persona dialogue system with prompt ablation research — public on GitHub.
Prompt Ablation Study — systematic research testing which system prompt components are load-bearing vs decorative for famous persona embodiment (Nietzsche, Aurelius, Watts). 443 experimental runs with transcripts. Key finding: minimal prompts work surprisingly well for famous figures — model priors are strong enough that explicit style guidance may be redundant.
Mastermind tab — multi-persona roundtable discussions; select 2–5 historical/philosophical figures and watch them debate your question with authentic voice.
Context engineering applied — dynamic persona hierarchy via CLAUDE.md inheritance. Hot-swappable AI providers (OpenAI, Anthropic, OpenRouter). Skip logic prevents duplicate experimental runs.
Stack: Next.js 15, React 19, TypeScript, Vercel AI SDK, Server-Sent Events.

github.com/marcus-w-hobbs/mastermind-alliance

[2020 ─ present]

~/wilsonic

Freelance · C++ audio plugin

C++ cross-platform audio plugin for Mac, Windows, Linux. Audio processing + UI on JUCE.
Real-time microtonal scale design. Interactive exploration of mathematical objects via psychoacoustics. Parameters automatable in the DAW.

github.com/marcus-w-hobbs/Wilsonic-MTS-ESP wilsonic.co

[2023 ─ 2024]

~/brilliant

Senior Engineering Manager

Led full-stack delivery across web and mobile. React, Swift, Kotlin, Node.js, GraphQL.

brilliant.org

[2019 ─ 2023]

San Francisco, CA

~/credit-karma

Engineering Manager II

Managed full-stack teams shipping financial products. React + Node + GraphQL + MongoDB. Secure auth, PII handling, consistent revenue and engagement tracking across products.

creditkarma.com

[2014 ─ 2019]

El Segundo, CA

~/att

Associate Director, Software Engineering

Led full-stack and mobile development of AT&T's streaming video service. Brand refresh, React Native migration, multi-vendor global delivery.
Led the Innovation Lab — ML prototypes for sentiment analysis of social media.

[2011 ─ 2014]

El Segundo, CA

~/directv

Senior Software Engineer

Migrated codebase from Objective-C to Swift. Shipped flagship iPad and Apple TV streaming apps.

##featured-projects

##prompt-engineering

Systematic approach to reliable AI agent development

A comprehensive methodology and toolkit for making AI coding agents reliable at scale. Demonstrates measurable acceleration in project delivery (8→6→4→3 weeks) through friction-driven context refinement, hierarchical CLAUDE.md specifications, and autonomous verification loops.

// key features

Friction-Driven Refinement: Converting agent failures into persistent context rules
Hierarchical Context Inheritance: Directory-based CLAUDE.md specifications
Autonomous Verification: Agents validate their own work via CDP tooling
Compound Learning System: Self-reinforcing solution documentation that forces agents to search past failures before planning — knowledge compounds across sessions
Production Implementation Prompts: 1500+ line Figma-to-code specifications
Parallel Worktree Orchestration: Multi-agent execution with deterministic port management
MCP Integration: Figma Remote MCP for design-to-code automation

[01]

4 production applications

[02]

50% faster delivery cycles

[03]

30+ compound solutions in week one

// tech stack

CLAUDE.mdSkillsSlash CommandsRulesMemoryHooksCDPMCPFigma Integration

impact

Demonstrates how systematic context management transforms agents into reliable executors at scale. Behavioral constraints in CLAUDE.md encode values as persistent rules rather than per-prompt instructions — friction-driven refinement is reinforcement learning at the workflow level. Directly applicable to any team scaling AI coding agents in production.

$view repository explore

##mastermind-alliance

AI persona dialogue system with prompt ablation research

Open-source platform for multi-persona philosophical dialogues, paired with systematic prompt engineering research. The ablation study tests which system prompt components are load-bearing vs decorative when embodying famous historical figures — including cross-model validation across Haiku, Sonnet, and Opus.

// key features

Prompt Ablation Study: Systematic removal of prompt components (tone, framework, rhetoric, themes, constraints) to measure impact on persona authenticity
Cross-Model Validation: Same prompts tested across Haiku (weak), Sonnet (mid), and Opus (strong) to prove findings generalize
Key Research Finding: Minimal prompts work across ALL model tiers — priors are in training data, not gated by capability
Mastermind Tab: Multi-persona roundtable discussions with 2-5 historical/philosophical figures debating user questions
Backrooms Tab: Stream-of-consciousness AI dialogues exploring liminal spaces of thought
LLM-as-Judge Evaluation: Automated scoring framework for persona embodiment quality
Interactive Playground: Explore results, compare outputs, and see the raw data

[01]

443 experimental runs

[02]

3 model tiers (Haiku → Sonnet → Opus)

[03]

3 personas × 5 variants × cross-model

// tech stack

Next.js 15React 19TypeScriptVercel AI SDKSSE StreamingAnthropic ClaudeOpenAI

impact

Research contribution to prompt engineering: for well-known personas, minimal prompts work as well as elaborate ones — and this holds across model capability tiers. Even Haiku produces recognizable Nietzsche with just "You are Friedrich Nietzsche." The finding has implications for prompt optimization, context-window efficiency, and how teams reason about what their system prompts are actually buying.

$view repository $interactive playground explore

##case-studies

Production work — written up in industry-portable terms.

[01] suite

##multi-agent-code-change-workflow-suite

A production prompt suite for multi-phase code changes with subagent teams, hard budgets, and schema-enforced contracts.

A production prompt suite for multi-phase code changes — plan, work, review — with subagent teams, hard tool-call and retry budgets, and schema-enforced output contracts between phases.

Each change is sized into a Light or Heavy track up front. Light mode handles isolated UI tweaks, copy edits, and refactors that don't touch schemas or contracts. Heavy mode triggers when a change touches one of the danger surfaces: models, migrations, auth, services, or API contracts. The mode determines which subagents are invoked, what their tool-call budgets are, and what's required in the review phase.

Subagents are file-domain-isolated. The backend implementer can only touch files matching the backend domain; the frontend implementer can only touch the frontend domain. Both write to a shared task log — a structured JSON file that acts as the durable contract between phases. The plan phase populates the task log; the work phase reads from and writes to it; the review phase verifies against it.

Output is schema-enforced. Subagents don't return prose — they return JSON conforming to a per-subagent schema, validated by the orchestrator before the next phase starts. Disagreement becomes typeable; downstream consumers don't have to parse intent.

// outcome

A multi-agent workflow engineering teams can use without constantly babysitting the loop. The system fails closed — when budgets exceed or schemas don't validate, the workflow halts and escalates instead of silently degrading. The meta-tool, not the agent.

[02] survey

##cross-team-prompt-pattern-survey

Surveyed another team's prompt suite, identified five patterns that travel, adopted them into a different greenfield codebase, and wrote the playbook.

The premise: prompt engineering is converging across teams, but nobody's harvesting what travels. I surveyed a brownfield team's prompt suite — roughly 6,800 lines, 21 slash commands, 5 subagent templates, and a year of accumulated tooling — and read it the way you'd read another engineer's library: looking for what's generalizable versus what's load-bearing on their specific context.

Five patterns travelled. (1) Risk-driven mode selection — sizing a change up front based on what surfaces it touches, then routing through different workflows. (2) File-domain isolation between subagents — preventing the backend agent from rewriting the frontend during a "while I'm here" detour. (3) Task-log files as durable phase contracts — JSON-on-disk as the source of truth between plan and work phases, surviving context window resets. (4) Active-fixer review agents — agents that resolve high-severity issues in-place under a budget, not just file a report. (5) Frontmatter-registered subagents — discoverable, restricted, with declared tool sets.

I adopted all five into a different team's greenfield suite — adapting them to the new codebase's conventions and constraints — and wrote a teammate-facing report so other AI engineers could pilot the same patterns without re-doing the survey.

// outcome

Multi-team capability lift from one engineer's archaeology. The meta-skill is pattern extraction across codebases — knowing which conventions survive translation and which are scaffolding.

##artifacts

Actual prompt-engineering work and methodology artifacts.

[01] prompt

##implementation-prompts

Multi-step Figma-to-React implementation specifications with 1500+ line prompts including component inventories, state machines, and acceptance criteria. Generated via Claude + Figma MCP iteration, executed by Claude Code.

[02] spec

##CLAUDE.md-examples

Frontend behavioral specifications with "NEVER" constraints, service layer architecture enforcement, autonomous debugging workflows, and agent delegation triggers that make agents deterministic executors.

[03] contract

##subagent-contracts

Registered subagents with frontmatter metadata, restricted tool sets, JSON output schemas, and active-fixer (not passive-reporter) contracts under hard tool-call budgets. A subagent's contract is its API — what it can call, what it can return, what counts as done.

[04] insight

##key-insight

The goal isn't to remove agent thinking — it's to shape it. Deterministic specs handle the 80% where execution matters. Behavioral constraints handle the 20% where agents must reason. The art is knowing which is which.

##philosophy

"We are not prompting anymore. We are orchestrating."// the L3 thesis

Context engineering is greater than prompting. The CLAUDE.md is where the magic lives — it's the constitution that makes agents reliable.

Each project teaches you where humans are doing work agents could do. Find the bottleneck. Give the agent eyes and hands. Encode the learnings. Compound.

The next frontier isn't better prompts — it's designing agent behavior. Not "do X" but "think this way before deciding." L2 prompting tells agents what to do. L3 engineering shapes how they reason. The difference: one makes agents execute, the other makes them reliable at novel tasks.

The unit of design isn't the prompt — it's the workflow. Multi-agent systems with hard budgets, schema-enforced outputs, and durable task-log contracts move prompt engineering from craft to discipline. The prompt is one component of a system; the system has invariants you can engineer for.

##skills

###prompt-engineering

Designing the prompts and the systems that hold them.

[pe-01]

##claude-code-mastery

CLAUDE.md · Skills · Rules · Hooks · Commands · Memory

System prompt design, evaluation suites, and full-stack context engineering across the complete Claude Code toolkit. Behavioral specs, capability modules, constraint systems, automation hooks, and persistent memory — orchestrated for reliable agent execution.

[pe-02]

##multi-agent-orchestration

plan → work → review · file-domain isolation · JSON schema enforcement

Multi-phase workflows with subagent teams. File-domain isolation, hard tool-call and retry budgets, JSON-schema-enforced outputs, risk-driven mode selection, durable task-log contracts between phases. Registered subagents with frontmatter, restricted tools, and active-fixer contracts.

[pe-03]

##schema-first-prompt-design

JSON Schema · output contracts · typed disagreement

The prompt populates a schema; the schema is the artifact. JSON-schema-enforced subagent outputs make multi-agent disagreement typeable and downstream consumption parser-free.

[pe-04]

##friction-driven-refinement

iterative agent calibration

Converting conversation friction into persistent context. Building "compressed histories of agent failures" that evolve from project learnings.

[pe-05]

##behavioral-pattern-analysis

agent failure studies · constraint design

Studying how agents fail to understand why. Converting failure patterns into behavioral constraints that prevent entire categories of errors.

[pe-06]

##counter-argument-design

steelman prompts · challenge loops

Building structured disagreement into agent workflows. Forcing agents to argue against their conclusions before presenting them — L3 behavioral engineering.

###agent-tooling

The infrastructure agents work inside of — local CLI, MCP, parallel orchestration.

[at-01]

##ai-sdlc-tooling

slash commands · registered subagents · output design

Designing the workflows engineering teams use to ship AI features safely. Slash commands, registered subagents, output design as a first-class concern, schema-bounded contracts. The meta-tools, not the agents.

[at-02]

##cdp-self-verification

Puppeteer · Chrome DevTools Protocol

Enabling agents to see and interact with the UI. Local CLI tools have far less overhead than MCP for autonomous debugging workflows.

[at-03]

##parallel-worktree-orchestration

port management · CDP integration

Spinning up isolated worktrees for parallel AI agents with deterministic port allocation. Agents start and work reliably by design.

[at-04]

##MCP-server-integration

Remote MCP · tool exposure

Exposing tools to agents as MCP servers. Figma Remote MCP for design integration, custom CLI tools for validation and verification.

[at-05]

##llm-driven-automation

launchd · perl · structured prompt pipelines

Health-check pipelines delivering structured notifications via launchd-scheduled diagnostics. Multi-line prompt substitution via perl + env vars. Self-locating data files for scripts deployed outside the repo. Autonomous PM-style workflows.

###production-stack

What ships the AI features once the agent's work is done.

[ps-01]

##frontend

React 19 · Next.js · Vite 7 · TypeScript

Building responsive and interactive user interfaces with modern React features, Next.js for optimal performance, and Vite for fast development.

[ps-02]

##backend

FastAPI · Python 3.12 · Node.js · ASP.NET Core

Creating robust server-side applications with FastAPI microservices, async programming, and clean architecture.

[ps-03]

##async-python-at-scale

asyncpg · asyncio.gather · SQLAlchemy AsyncSession · Celery

asyncpg, asyncio.gather concurrency-safety, SQLAlchemy AsyncSession lifecycle. Familiar with the asymmetries that matter in production — what SQLite tolerates that asyncpg crashes on, where Loguru silently swallows extra= kwargs, when a bound async session can and can't be shared.

[ps-04]

##databases

PostgreSQL 16 · SQLAlchemy 2.0 · Supabase · SQL Server

Designing and implementing efficient database schemas with async ORMs, migrations (Alembic), and secure data management.

[ps-05]

##cross-boundary-debugging

Pydantic / SQLAlchemy / FastAPI / React / TypeScript

Production root-cause work across Pydantic, SQLAlchemy, FastAPI, React, TypeScript. Multi-layer defenses for schema invariants (Alembic clamp + DB constraint + Pydantic + read-time helper). Backend/frontend parser parity (snake_case ↔ camelCase, JSON-escape normalization). Migration safety on existing rows.

[ps-06]

##ui-ux-systems

Tailwind CSS 4 · Radix UI · Storybook · Figma Code Connect

Crafting beautiful and intuitive user interfaces with modern design principles, component libraries, and bidirectional design system sync.

[ps-07]

##testing-quality

pytest · Vitest · React Testing Library · MSW

Comprehensive testing strategies including unit, integration, and API mocking for reliable software delivery.

###higher-order

Dispositions — how I think about engineering teams, not just code.

[ho-01]

##cross-team-pattern-extraction

survey · identify · encode

Surveys other teams' prompt suites and tooling, identifies what travels and what's team-specific, encodes shared patterns into reusable tooling. The meta-skill is knowing which patterns survive contact with a different codebase.

[ho-02]

##ai-sdlc-innovation

meta-tools, not agents

Builds the meta-tools engineering teams use to ship AI features safely. Treats the SDLC itself as a product surface — multi-agent workflows, subagent contracts, output design, schema-first prompts.

[ho-03]

##output-design-first-class

decomposition before polish

Treats agent output presentation as a design decision downstream of work decomposition, not a polish layer. What the agent says is determined by how the work is broken up; the prompt's structure is the user interface.

[ho-04]

##bridging-creative-systems-modes

creator's default ↔ engineer's default

Comfortable in both modes — "creator's default" (conversational, exploratory, narrating) and "engineer's default" (state machines, budgets, schemas, silence-until-result). Knows which mode each task calls for.

##education

[degree-01]

##bachelors-degree-in-mathematics-and-computer-science

University of California at Riverside

##contact

// get in touch

[email]: marcushobbs@mac.com
[github]: github.com/marcus-w-hobbs
[linkedin]: linkedin.com/in/marcushobbs
[phone]: +1 (818) 802-4085
[loc]: Hermosa Beach, CA

// open to AI engineering roles · prompt engineering · agent orchestration