Platform-native adapters, one stubborn rule: stop before you build.

Claude Code

Codex

OpenCode

Gemini CLI

Copilot

Cline

The ladder

Ponytail Skill is not “be brief”. It is a stop rule.

Before writing code, the agent climbs a boring decision ladder. Boring is the point: fewer dependencies, fewer abstractions, fewer clever mistakes.

1

Does this need to exist?

If the feature is speculative, skip it. If the option has one caller, inline it. If nobody asked for a framework, do not invent one.

2

Does the platform already do it?

Standard library first. Native browser controls first. Existing dependencies before new dependencies. Reach for boring tools before code.

3

Write the smallest safe thing

Only after the earlier rungs fail, implement the minimum that works. Validation, accessibility, tests, and security are not optional cuts.

What changes

The problem is not bad code. It is too much code.

Ponytail Skill changes the agent’s default move from “build a small system” to “look for the boring answer first.”

Deletes overbuilding pressure

No wrapper component when an HTML element works. No adapter layer when there is one implementation. No config nobody sets.

Prefers stdlib and native features

The ladder checks the language, runtime, browser, and current dependency graph before letting the agent write new code.

Keeps the safety rails

The rule is not code golf. Ponytail Skill explicitly protects validation, error handling, accessibility, security, and one useful smoke test.

Leaves an upgrade path

Deliberate shortcuts can be marked as ponytail debt, with the ceiling and trigger that tell future you when to revisit them.

Measures real agent work

Benchmarks count git diff added lines from real Claude Code sessions, not chat completions padded with prose.

Works where agents work

Claude Code, Codex-style agents, OpenCode, Gemini CLI, Copilot, Cline, and plain skill files can all carry the same discipline.

Modes

Four ways to keep an agent honest

The homepage should teach the tool as a workflow: code smaller, review complexity, audit cuttable code, and track deferred shortcuts.

Skip what does not need to exist

YAGNI is the first rung, not an afterthought.

Use boring primitives

Stdlib, native controls, and installed dependencies come before new code.

Protect the guardrails

Validation, accessibility, security, and smoke tests survive the cut.

Stop when lean enough

If there is nothing useful to cut, the right answer is: Ship.

Code

Persistent Ponytail Skill mode applies the ladder before every implementation.

Install Ponytail Skill

Delete

Find dead code, unused flexibility, and speculative features.

Replace

Name the standard library or native platform feature that should replace custom code.

Score

End with a net line count so the review has a concrete target.

Stay in scope

Complexity review does not pretend to be a security review.

Review

Complexity review hunts abstractions, dependencies, and dead flexibility.

See review mode

Mark the shortcut

Use a ponytail comment that names what was simplified.

Name the trigger

Record the exact condition that justifies revisiting the shortcut.

Harvest the ledger

/ponytail-debt collects markers so later does not mean never.

Keep it honest

A no-trigger shortcut is flagged as rot risk.

Debt

Every deliberate shortcut gets a ceiling and an upgrade trigger.

Read debt ledger

Benchmarks

Benchmarked against real agent work, not chat completions

The fair baseline is the same coding agent doing real edits with no skill. Ponytail Skill is compared against that, a terse-prose control, and a one-liner YAGNI prompt.

Feature	No skill baseline	Caveman terse prose	Most Popular Ponytail decision ladder
Added LOC vs baseline	100%	-20%	-54%
Token usage	100%	+7%	-22%
Run cost	100%	+3%	-20%
Elapsed time	100%	+2%	-27%
Adversarial safety score	100%	100%	100%
	Read method	Compare	Install

Principles

What Ponytail Skill optimizes for

Not vibes. Not cleverness. A short list of habits that keep diffs reviewable.

"Native beats clever. If the browser ships the control, the agent should have to justify building one."

The date picker rule

HTML before dependency

"A shortcut without a trigger is just debt pretending to be taste. Mark the ceiling, then move on."

The debt ledger rule

Deferred work with an upgrade path

"If a reviewer cannot tell what changed in one screen, the agent probably built more story than software."

The review rule

Small diffs over heroic abstractions

Method

A benchmark that tries not to flatter itself

The story is stronger because it admits what the first benchmark got wrong and fixes the measurement.

Old run

Released

Single-shot completions looked too good

The early benchmark counted whole answers, including prose and options, so a chatty baseline inflated the savings.

Single shotCritique

Fix

Released

Measure real agent edits instead

Each arm runs as a fresh headless Claude Code session against a pinned FastAPI + React repo, and the score is the git diff it leaves behind.

Claude CodeGit diff

Control

Released

Compare against terse prose and YAGNI prompts

Caveman tests whether the effect is just shorter communication. A seven-word YAGNI prompt tests whether the skill is overkill.

ControlsIsolation

Safety

Released

Run adversarial guard checks

The safety tier checks whether smaller code drops validation and edge-case handling. Ponytail Skill kept the full safety score.

ValidationSafety

More agents, same public method

The harness can keep adding models and agent surfaces without changing the claim: count what lands in the repository.

ReproduceExtend

Install

Use the same discipline wherever your agent lives

Ponytail Skill ships as skills, hooks, plugin adapters, and plain instruction files. Pick the native path for your tool instead of inventing another workflow.

Claude Code

Hooks and skills that persist the Ponytail mode inside Claude Code sessions.

Agents

Codex CLI

Drop the skill into agent instructions and keep diffs small from the terminal.

Agents

OpenCode

Plugin adapter for OpenCode command and system prompt integration.

Agents

Gemini CLI

Extension manifest plus shared Ponytail rules for Gemini-driven coding sessions.

Agents

GitHub Copilot

Instruction files and command copies for editor-native assistant workflows.

Editors

Cline

Portable rules for code assistants that read project-level instruction files.

Editors

Promptfoo

Reproduce the old single-shot benchmark and inspect the newer agentic harness.

Proof

Git diff

The main metric is the code the agent leaves behind, not how confident it sounded.

Proof

Writeups

Read the receipts

Good developer marketing shows the method, the caveats, and the failure modes. Start with the benchmark notes.

Benchmark

Jun 18, 2026 12 min read

Agentic benchmark: does Ponytail Skill cut code without cutting safety?

A rebuilt benchmark using real Claude Code sessions, a pinned open-source repo, isolated arms, and git diff added lines instead of chat output.

Ponytail

Safety

Jun 17, 2026 7 min read

Agentic safety: shorter code still has to survive bad input

The separate safety tier checks whether minimization quietly removes validation, error handling, and guard behavior.

Ponytail

Cost

Jun 17, 2026 6 min read

Cost verification: smaller diffs are not automatically cheaper

A careful note on why token totals, thinking behavior, wall-clock time, and model pricing need to be measured instead of assumed.

Ponytail

FAQ

The uncomfortable questions first

Ponytail Skill is intentionally small, so the claims should be precise.

Is Ponytail Skill just a prompt that says “write less code”?

No. Terse prose is a control arm in the benchmark and does not produce the same result. Ponytail Skill is a persistent decision ladder: skip, stdlib, native platform, existing dependency, one line, then minimum implementation.

Does it make agents less safe?

The published agentic benchmark includes adversarial safety scoring. Ponytail Skill kept a 100% safety score while cutting code, because it explicitly refuses to cut validation, error handling, accessibility, security, or a useful smoke test.

Where do the numbers come from?

From real headless Claude Code sessions editing a pinned FastAPI + React open-source repo. The metric is git diff added lines, plus token, cost, and time totals. The benchmark is written to be reproducible and to address critique of the older single-shot numbers.

Will it always reduce code by 54%?

No. That is the mean in the benchmark, not a promise for every repository. The cut is largest when an agent would overbuild and near zero when the existing answer is already lean.

What happens when a shortcut needs to grow up?

Use a ponytail debt marker: name the ceiling and the trigger. The /ponytail-debt command harvests those markers into a ledger so deliberate shortcuts do not quietly rot.

Who is this for?

Developers who use coding agents and care about small diffs, boring dependencies, reviewable changes, and being able to explain why a feature did not become a subsystem.

Make your AI agent code like a lazy senior engineer

Ponytail Skill is not “be brief”. It is a stop rule.

Does this need to exist?

Does the platform already do it?

Write the smallest safe thing

The problem is not bad code. It is too much code.

Deletes overbuilding pressure

Prefers stdlib and native features

Keeps the safety rails

Leaves an upgrade path

Measures real agent work

Works where agents work

Four ways to keep an agent honest

Skip what does not need to exist

Use boring primitives

Protect the guardrails

Stop when lean enough

Code

Benchmarked against real agent work, not chat completions

What Ponytail Skill optimizes for

A benchmark that tries not to flatter itself

Single-shot completions looked too good

Measure real agent edits instead

Compare against terse prose and YAGNI prompts

Run adversarial guard checks

More agents, same public method

Use the same discipline wherever your agent lives

Claude Code

Codex CLI

OpenCode

Gemini CLI

GitHub Copilot

Cline

Promptfoo

Git diff

Read the receipts

Agentic benchmark: does Ponytail Skill cut code without cutting safety?

Agentic safety: shorter code still has to survive bad input

Cost verification: smaller diffs are not automatically cheaper

The uncomfortable questions first

Install the boring senior engineer

Make your AI agent
code like a lazy senior engineer