Skip to main content
A tiny skill for large language models

Make your AI agent
code like a lazy senior engineer

Ponytail Skill teaches coding agents to stop before they overbuild: use the standard library, native platform features, installed dependencies, and the smallest safe implementation.

-54% added LOC
-22% tokens
100% safety score
-54%
Less added code
-20%
Lower cost
-27%
Faster runs
14
Agent surfaces

Platform-native adapters, one stubborn rule: stop before you build.

Claude Code
Codex
OpenCode
Gemini CLI
Copilot
Cline
The ladder

Ponytail Skill is not “be brief”. It is a stop rule.

Before writing code, the agent climbs a boring decision ladder. Boring is the point: fewer dependencies, fewer abstractions, fewer clever mistakes.

1

Does this need to exist?

If the feature is speculative, skip it. If the option has one caller, inline it. If nobody asked for a framework, do not invent one.

2

Does the platform already do it?

Standard library first. Native browser controls first. Existing dependencies before new dependencies. Reach for boring tools before code.

3

Write the smallest safe thing

Only after the earlier rungs fail, implement the minimum that works. Validation, accessibility, tests, and security are not optional cuts.

What changes

The problem is not bad code. It is too much code.

Ponytail Skill changes the agent’s default move from “build a small system” to “look for the boring answer first.”

Deletes overbuilding pressure

No wrapper component when an HTML element works. No adapter layer when there is one implementation. No config nobody sets.

Prefers stdlib and native features

The ladder checks the language, runtime, browser, and current dependency graph before letting the agent write new code.

Keeps the safety rails

The rule is not code golf. Ponytail Skill explicitly protects validation, error handling, accessibility, security, and one useful smoke test.

Leaves an upgrade path

Deliberate shortcuts can be marked as ponytail debt, with the ceiling and trigger that tell future you when to revisit them.

Measures real agent work

Benchmarks count git diff added lines from real Claude Code sessions, not chat completions padded with prose.

Works where agents work

Claude Code, Codex-style agents, OpenCode, Gemini CLI, Copilot, Cline, and plain skill files can all carry the same discipline.

Modes

Four ways to keep an agent honest

The homepage should teach the tool as a workflow: code smaller, review complexity, audit cuttable code, and track deferred shortcuts.

Skip what does not need to exist

YAGNI is the first rung, not an afterthought.

Use boring primitives

Stdlib, native controls, and installed dependencies come before new code.

Protect the guardrails

Validation, accessibility, security, and smoke tests survive the cut.

Stop when lean enough

If there is nothing useful to cut, the right answer is: Ship.

Code

Persistent Ponytail Skill mode applies the ladder before every implementation.

Install Ponytail Skill
Benchmarks

Benchmarked against real agent work, not chat completions

The fair baseline is the same coding agent doing real edits with no skill. Ponytail Skill is compared against that, a terse-prose control, and a one-liner YAGNI prompt.

Feature
No skill baseline
Caveman terse prose
Most Popular Ponytail decision ladder
Added LOC vs baseline 100% -20% -54%
Token usage 100% +7% -22%
Run cost 100% +3% -20%
Elapsed time 100% +2% -27%
Adversarial safety score 100% 100% 100%
Read method Compare Install
Principles

What Ponytail Skill optimizes for

Not vibes. Not cleverness. A short list of habits that keep diffs reviewable.

"Native beats clever. If the browser ships the control, the agent should have to justify building one."
The date picker rule

The date picker rule

HTML before dependency

"A shortcut without a trigger is just debt pretending to be taste. Mark the ceiling, then move on."
The debt ledger rule

The debt ledger rule

Deferred work with an upgrade path

"If a reviewer cannot tell what changed in one screen, the agent probably built more story than software."
The review rule

The review rule

Small diffs over heroic abstractions

Method

A benchmark that tries not to flatter itself

The story is stronger because it admits what the first benchmark got wrong and fixes the measurement.

Old run
Released

Single-shot completions looked too good

The early benchmark counted whole answers, including prose and options, so a chatty baseline inflated the savings.

Single shotCritique
Fix
Released

Measure real agent edits instead

Each arm runs as a fresh headless Claude Code session against a pinned FastAPI + React repo, and the score is the git diff it leaves behind.

Claude CodeGit diff
Control
Released

Compare against terse prose and YAGNI prompts

Caveman tests whether the effect is just shorter communication. A seven-word YAGNI prompt tests whether the skill is overkill.

ControlsIsolation
Safety
Released

Run adversarial guard checks

The safety tier checks whether smaller code drops validation and edge-case handling. Ponytail Skill kept the full safety score.

ValidationSafety
Next
In Progress

More agents, same public method

The harness can keep adding models and agent surfaces without changing the claim: count what lands in the repository.

ReproduceExtend
Install

Use the same discipline wherever your agent lives

Ponytail Skill ships as skills, hooks, plugin adapters, and plain instruction files. Pick the native path for your tool instead of inventing another workflow.

Claude Code

Hooks and skills that persist the Ponytail mode inside Claude Code sessions.

Agents

Codex CLI

Drop the skill into agent instructions and keep diffs small from the terminal.

Agents

OpenCode

Plugin adapter for OpenCode command and system prompt integration.

Agents

Gemini CLI

Extension manifest plus shared Ponytail rules for Gemini-driven coding sessions.

Agents

GitHub Copilot

Instruction files and command copies for editor-native assistant workflows.

Editors

Cline

Portable rules for code assistants that read project-level instruction files.

Editors

Promptfoo

Reproduce the old single-shot benchmark and inspect the newer agentic harness.

Proof

Git diff

The main metric is the code the agent leaves behind, not how confident it sounded.

Proof
Writeups

Read the receipts

Good developer marketing shows the method, the caveats, and the failure modes. Start with the benchmark notes.

FAQ

The uncomfortable questions first

Ponytail Skill is intentionally small, so the claims should be precise.

Is Ponytail Skill just a prompt that says “write less code”?

No. Terse prose is a control arm in the benchmark and does not produce the same result. Ponytail Skill is a persistent decision ladder: skip, stdlib, native platform, existing dependency, one line, then minimum implementation.

Does it make agents less safe?

The published agentic benchmark includes adversarial safety scoring. Ponytail Skill kept a 100% safety score while cutting code, because it explicitly refuses to cut validation, error handling, accessibility, security, or a useful smoke test.

Where do the numbers come from?

From real headless Claude Code sessions editing a pinned FastAPI + React open-source repo. The metric is git diff added lines, plus token, cost, and time totals. The benchmark is written to be reproducible and to address critique of the older single-shot numbers.

Will it always reduce code by 54%?

No. That is the mean in the benchmark, not a promise for every repository. The cut is largest when an agent would overbuild and near zero when the existing answer is already lean.

What happens when a shortcut needs to grow up?

Use a ponytail debt marker: name the ceiling and the trigger. The /ponytail-debt command harvests those markers into a ledger so deliberate shortcuts do not quietly rot.

Who is this for?

Developers who use coding agents and care about small diffs, boring dependencies, reviewable changes, and being able to explain why a feature did not become a subsystem.