The 5 Dimensions of AI Skill Quality: How to Score and Improve Your Prompts
Clarity, specificity, structure, completeness, actionability - a framework for evaluating any AI skill. Includes a self-assessment checklist and before/after examples.
Most AI skills score 1-2 out of 5. They're vague instructions wrapped in Markdown that the AI half-follows. The difference between a 2/5 skill and a 5/5 skill is measurable: a high-quality skill produces consistent output across sessions, requires fewer follow-up corrections, and saves 15-30 minutes per use. A low-quality skill is marginally better than having no skill at all.
This guide introduces a practical scoring framework with 5 dimensions, a self-assessment checklist, and a complete before/after transformation showing how to take a skill from score 1 to score 5, one dimension at a time.
The 5 Quality Dimensions
Each dimension is binary: 0 (missing/inadequate) or 1 (present/adequate). Total score ranges from 0 to 5. This isn't a subjective rubric - each dimension has concrete criteria you can check in 30 seconds.
Dimension 1: Clarity
Clarity measures whether someone who has never seen your skill can understand what it does within 10 seconds of reading it. This means the name is self-explanatory, the description is one sentence of plain English, and the instructions don't use unexplained jargon or assume context that isn't provided.
The test is simple: show the skill to a colleague who didn't write it. Can they tell you what it does without asking questions? If they need to read the whole document to understand the purpose, clarity is 0. If the name and first sentence tell the story, clarity is 1.
Clarity failures compound with team size. A solo developer can tolerate a cryptically-named skill because they wrote it and remember the context. On a team of 5, that same skill becomes a mystery that nobody uses because nobody knows what it does.
Score 0: Unclear
| 1 | --- |
| 2 | name: cr-v2 |
| 3 | description: does code stuff |
| 4 | --- |
| 5 | Check things and report. |
Name is an abbreviation. Description is meaningless. Instructions say nothing.
Score 1: Clear
| 1 | --- |
| 2 | name: code-review |
| 3 | description: Reviews code for bugs, security |
| 4 | issues, and maintainability problems |
| 5 | --- |
| 6 | Perform a structured code review... |
Name is descriptive. Description tells you what it does and what it checks for.
Common pitfall:Using abbreviations in skill names. “cr” could mean code review, change request, or carriage return. Always spell it out.
Dimension 2: Specificity
Specificity measures whether the instructions are concrete enough that two different AI models would produce the same output. Vague instructions like “review code carefully” are interpreted differently by every model in every session. Specific instructions like “check for null handling on every function parameter” produce consistent results.
The litmus test: can you replace your skill's instructions with “do a good job” and get meaningfully different output? If not, your skill isn't specific enough. Every instruction should constrain the AI's behavior in a way that “do a good job” doesn't.
Specificity is the dimension with the highest impact on output quality. A skill that scores 1 on specificity and 0 on everything else still outperforms a skill that scores 1 on all other dimensions but 0 on specificity. Concrete steps beat good formatting every time.
Score 0: Vague
| 1 | Review the code thoroughly and provide |
| 2 | helpful feedback on quality and style. |
Every AI already tries to do this. Zero constraining value.
Score 1: Specific
| 1 | 1. Check every function for: null params, |
| 2 | missing return types, unhandled exceptions |
| 3 | 2. Flag functions over 30 lines |
| 4 | 3. Format: [Severity] (Line N): Title |
Concrete checklist. Measurable thresholds. Output format defined.
Common pitfall:Using adjectives instead of criteria. “Clean code” is subjective. “Functions under 30 lines with no more than 3 parameters” is measurable.
Dimension 3: Structure
Structure measures whether the skill is organized with headings, lists, and clear sections that the AI can parse efficiently. A wall of text is harder for the AI to follow than a numbered list with bold headings. Structure isn't about aesthetics - it directly affects how well the AI extracts and follows instructions.
LLMs process structured text better than prose. When instructions are in a numbered list, the model can track “I've completed step 3, now I need step 4.” When instructions are in a paragraph, the model has to parse which sentence is an instruction and which is context. Numbered steps reduce instruction-skipping by roughly 40% in practice.
The ideal structure for a skill is: heading (“Steps”), numbered list of actions, heading (“Examples”), input/output pairs, heading (“Don't”), bullet list of prohibitions. This is the pattern that tools like Claude Code and Codex CLI are optimized for.
Score 0: Unstructured
| 1 | You should review the code and check for |
| 2 | bugs. Also look for security issues. Make |
| 3 | sure to provide examples. Don't forget to |
| 4 | check error handling too. Format your |
| 5 | output nicely. |
Score 1: Structured
| 1 | ## Steps |
| 2 | 1. **Bugs**: null handling, off-by-one |
| 3 | 2. **Security**: injection, auth bypass |
| 4 | 3. **Errors**: unhandled exceptions |
| 5 | |
| 6 | ## Output format |
| 7 | [Severity] (Line N): Title - fix |
Common pitfall: Over-structuring. A skill with 10 heading levels and 50 bullet points is as hard to follow as a wall of text. Keep it to 2-3 sections with 4-8 items each.
Dimension 4: Completeness
Completeness measures whether the skill includes all components needed for reliable AI behavior: examples, edge cases, triggers, and negative examples (Don'ts). A skill with perfect instructions but no examples is like a recipe with ingredient quantities but no photos of the final dish - the cook has to guess what “done” looks like.
The minimum bar for completeness is: at least 2 input/output examples, a Don't section with 3+ items, and trigger phrases. The examples show the AI the expected format and depth. The Don't section prevents the most common unwanted behaviors. The triggers enable automatic activation.
Completeness is the dimension most developers skip because it feels like extra work. Writing 2 examples takes 10 minutes. But those 10 minutes prevent hours of correcting AI output that almost-but-not-quite matches what you wanted. Examples are the single highest-ROI component of any skill.
Score 0: Incomplete
| 1 | ## Steps |
| 2 | 1. Review the code |
| 3 | 2. Report issues |
| 4 | (no examples, no triggers, no don'ts) |
Score 1: Complete
| 1 | ## Steps (4 numbered items) |
| 2 | ## Examples (2 input/output pairs) |
| 3 | ## Triggers (5 phrases) |
| 4 | ## Don't (4 prohibitions) |
Common pitfall:Examples that are too simple. Showing a trivial case doesn't teach the AI how to handle complexity. Include at least one example with an edge case or non-obvious scenario.
Dimension 5: Actionability
Actionability measures whether every instruction tells the AI what to do, not what to know. “You are an expert code reviewer” is knowledge (the AI ignores it - it's already trying to be helpful). “For each function, check: null handling, error propagation, return type correctness” is action (the AI does something specific).
The easiest test: read each sentence in your skill and ask “can the AI act on this right now?” If the answer is “it depends” or “it's background info,” the sentence isn't actionable. Cut it or convert it to an action step. Every sentence should either be an instruction the AI follows or an example it imitates.
Actionability is what separates skills that “kinda work” from skills that “just work.” A skill with high actionability produces the same output regardless of the model's mood, the conversation context, or whether it's 2 AM and the API is slow. The instructions are so concrete that interpretation is minimized.
Score 0: Knowledge-based
| 1 | You are an expert code reviewer with deep |
| 2 | knowledge of security best practices and |
| 3 | software architecture principles. |
Score 1: Action-based
| 1 | 1. Read the entire diff before commenting |
| 2 | 2. For each function, check: null params, |
| 3 | missing error handling, SQL injection |
| 4 | 3. Format: **[Bug] (Line N):** description |
Common pitfall:Starting with “You are...” or “Act as...” preambles. These waste tokens and don't change behavior. Skip the identity statement and go straight to instructions.
Self-Assessment: 15 Questions
Answer these yes/no questions for any skill. Count the “yes” answers in each group of 3 to get your dimension score (0 or 1 - score 1 if at least 2 of 3 are “yes”).
Clarity
Can someone understand what this skill does from the name alone?
Is the description one clear sentence with no jargon?
Could a junior developer use this skill without asking you what it means?
Score 1 if at least 2 answers are “yes”
Specificity
Does every instruction include measurable criteria (numbers, thresholds, formats)?
Would replacing your instructions with 'do a good job' change the AI's output?
Are there zero sentences that use vague adjectives (thorough, careful, comprehensive)?
Score 1 if at least 2 answers are “yes”
Structure
Are instructions in a numbered list (not a paragraph)?
Does the skill have distinct sections with headings (Steps, Examples, Don't)?
Can you find any specific instruction within 5 seconds of opening the file?
Score 1 if at least 2 answers are “yes”
Completeness
Are there at least 2 input/output examples?
Is there a Don't section with 3+ items?
Are there trigger phrases for automatic activation?
Score 1 if at least 2 answers are “yes”
Actionability
Does every sentence tell the AI what to DO (not what to KNOW or BE)?
Are there zero 'You are...' or 'Act as...' preambles?
Could the AI execute each instruction without asking for clarification?
Score 1 if at least 2 answers are “yes”
Scoring math: Add up dimension scores. Total = Clarity + Specificity + Structure + Completeness + Actionability. Range: 0-5. Most skills start at 1-2. Target: 4-5.
Full Transformation: Score 1 to Score 5
Let's take a real skill and improve it step by step through all 5 dimensions. Each step adds one dimension, and you can see exactly how the quality and output consistency improve.
Starting Point: Score 1/5
| 1 | --- |
| 2 | name: test-gen |
| 3 | --- |
| 4 | You are an expert test writer. Write comprehensive tests for |
| 5 | the given code. Make sure to cover edge cases and use good |
| 6 | testing practices. The tests should be thorough. |
Score: Clarity 1 (name is okay), Specificity 0, Structure 0, Completeness 0, Actionability 0 = 1/5. The name gives a hint, but everything else is vague platitudes.
Step 1: Add Specificity (Score 2/5)
| 1 | --- |
| 2 | name: test-gen |
| 3 | --- |
| 4 | You are an expert test writer. When writing tests: |
| 5 | 1. Write a happy path test for the most common input |
| 6 | 2. Write edge case tests: empty input, null, boundary values, |
| 7 | negative numbers, max length strings |
| 8 | 3. Write error case tests: invalid types, network failures, |
| 9 | permission denied |
| 10 | 4. Use descriptive names: "should return empty array when |
| 11 | user has no orders" |
| 12 | 5. One assertion per test (or closely related assertions) |
| 13 | 6. Use the project's test framework (Jest/Vitest) |
Score: Clarity 1, Specificity 1, Structure 0, Completeness 0, Actionability 0 = 2/5. Now we have concrete steps, but they're mixed into the preamble and there are no examples.
Step 2: Add Structure (Score 3/5)
| 1 | --- |
| 2 | name: test-writer |
| 3 | description: Generates tests with edge cases and meaningful assertions |
| 4 | version: 1.0.0 |
| 5 | tags: [testing, quality] |
| 6 | --- |
| 7 | |
| 8 | # Test Writer |
| 9 | |
| 10 | ## Steps |
| 11 | 1. **Happy path** - test the most common successful usage |
| 12 | 2. **Edge cases** - empty input, null, boundary values, negative numbers |
| 13 | 3. **Error cases** - invalid types, network failure, permission denied |
| 14 | 4. **Naming** - "should [expected result] when [condition]" |
| 15 | 5. **Assertions** - one per test, testing behavior not implementation |
| 16 | 6. **Framework** - use Jest/Vitest with the project's conventions |
Score: Clarity 1, Specificity 1, Structure 1, Completeness 0, Actionability 0 = 3/5. Now it's organized with headings and a clean numbered list. But still no examples or negative rules.
Step 3: Add Completeness (Score 4/5)
| 1 | --- |
| 2 | name: test-writer |
| 3 | description: Generates tests with edge cases and meaningful assertions |
| 4 | version: 1.0.0 |
| 5 | tags: [testing, quality] |
| 6 | --- |
| 7 | |
| 8 | # Test Writer |
| 9 | |
| 10 | ## Steps |
| 11 | 1. **Happy path** - test the most common successful usage |
| 12 | 2. **Edge cases** - empty, null, boundary, negative, max length |
| 13 | 3. **Error cases** - invalid types, network failure, permission denied |
| 14 | 4. **Naming** - "should [expected result] when [condition]" |
| 15 | 5. **Assertions** - one per test, test behavior not implementation |
| 16 | |
| 17 | ## Examples |
| 18 | |
| 19 | ### Given: calculateDiscount(price, tier) |
| 20 | ```typescript |
| 21 | it("should apply 20% for gold tier", () => { |
| 22 | expect(calculateDiscount(100, "gold")).toBe(80); |
| 23 | }); |
| 24 | it("should return 0 for price of 0", () => { |
| 25 | expect(calculateDiscount(0, "gold")).toBe(0); |
| 26 | }); |
| 27 | ``` |
| 28 | |
| 29 | ## Triggers |
| 30 | - "write tests for" |
| 31 | - "add test coverage" |
| 32 | - "test this function" |
| 33 | |
| 34 | ## Don't |
| 35 | - Don't write tests that only check toBeDefined() |
| 36 | - Don't mock everything - only external dependencies |
| 37 | - Don't test private methods or internal state |
| 38 | - Don't skip the edge cases for "simplicity" |
Score: Clarity 1, Specificity 1, Structure 1, Completeness 1, Actionability 0 = 4/5. We have examples, triggers, and Don'ts. The only gap is actionability - the “You are...” preamble is gone but let's make every instruction truly action-oriented.
Step 4: Add Actionability (Score 5/5)
| 1 | --- |
| 2 | name: test-writer |
| 3 | description: Generates tests with edge cases and meaningful assertions |
| 4 | version: 1.1.0 |
| 5 | tags: [testing, quality, jest, vitest] |
| 6 | --- |
| 7 | |
| 8 | # Test Writer |
| 9 | |
| 10 | Generate thorough tests for the given code. |
| 11 | |
| 12 | ## Steps |
| 13 | |
| 14 | 1. **Analyze the function** - list its inputs, outputs, and side effects |
| 15 | 2. **Write happy path test** - most common successful input/output |
| 16 | 3. **Write 3+ edge case tests** - empty input, null, 0, boundary |
| 17 | values, negative numbers, max-length strings |
| 18 | 4. **Write 2+ error case tests** - invalid type, missing required |
| 19 | param, network timeout, permission denied |
| 20 | 5. **Name each test** - "should [verb] [expected] when [condition]" |
| 21 | 6. **Assert specific values** - use toBe(80), not toBeTruthy() |
| 22 | 7. **Add setup/teardown** if tests share state or need mocks |
| 23 | |
| 24 | ## Examples |
| 25 | |
| 26 | ### Given: calculateDiscount(price: number, tier: string): number |
| 27 | ```typescript |
| 28 | describe("calculateDiscount", () => { |
| 29 | it("should apply 20% discount for gold tier", () => { |
| 30 | expect(calculateDiscount(100, "gold")).toBe(80); |
| 31 | }); |
| 32 | it("should return 0 when price is 0", () => { |
| 33 | expect(calculateDiscount(0, "gold")).toBe(0); |
| 34 | }); |
| 35 | it("should handle negative prices by returning 0", () => { |
| 36 | expect(calculateDiscount(-50, "gold")).toBe(0); |
| 37 | }); |
| 38 | it("should throw on invalid tier", () => { |
| 39 | expect(() => calculateDiscount(100, "platinum")).toThrow(); |
| 40 | }); |
| 41 | }); |
| 42 | ``` |
| 43 | |
| 44 | ## Triggers |
| 45 | - "write tests for" |
| 46 | - "add test coverage" |
| 47 | - "test this function" |
| 48 | - "create unit tests" |
| 49 | |
| 50 | ## Don't |
| 51 | - Don't write tests that only check toBeDefined() or toBeTruthy() |
| 52 | - Don't mock everything - only mock external dependencies (DB, API) |
| 53 | - Don't test implementation details (private methods, internal state) |
| 54 | - Don't skip edge cases to save time |
Score: Clarity 1, Specificity 1, Structure 1, Completeness 1, Actionability 1 = 5/5. Every instruction is an action. “Write 3+ edge case tests” is something the AI can do immediately. “Analyze the function” with specific sub-tasks (list inputs, outputs, side effects) tells the AI exactly what analyzing means.
Impact of Each Score Level
To illustrate the practical difference, here's what you can expect at each score level when using a code review skill:
Score 1
Generic feedback. 'This code could be improved.' No line references. Mix of useful and useless comments. You spend 5 minutes filtering signal from noise.
Score 2
Some specific findings but inconsistent format. Catches obvious bugs but misses subtle ones. Output varies between sessions.
Score 3
Organized findings with categories. Consistent format. Catches most bugs. But sometimes flags things you don't care about (style, formatting).
Score 4
Line-specific findings with severity levels. Consistent every time. Skips formatting issues. But occasionally misses edge cases you'd catch manually.
Score 5
Precise, line-referenced findings in your exact format. Catches null handling, security issues, and maintainability problems in priority order. Consistent across sessions and models. You trust the output.
Using the Score
Score every skill you write against this rubric before deploying it. A 3/5 is your minimum for production use - below that, the skill is too unreliable to depend on. Focus improvement effort on the missing dimensions, starting with Specificity (highest impact) and Completeness (second highest).
For teams, make the scoring part of your skill review process. When someone proposes a new skill, score it as a team. Reject anything below 3/5. It takes 30 minutes to improve a 2/5 to a 4/5 - and that 30 minutes saves hours of inconsistent AI output across the team.
The scoring framework isn't subjective. Two people scoring the same skill will arrive at the same number because each dimension has binary, verifiable criteria. This makes it useful for team standards: “All skills in our library must score at least 4/5” is an enforceable rule, not a vague aspiration.
Manage your skills with Praxl
Edit once, deployed to every AI tool. Version history, AI review, team sharing.
Try Praxl free
Praxl