The 5 Dimensions of AI Skill Quality: How to Score and Improve Your Prompts

Clarity, specificity, structure, completeness, actionability - a framework for evaluating any AI skill. Includes a self-assessment checklist and before/after examples.

·8 min read·
qualityAI reviewframework

Most AI skills score 1-2 out of 5. They're vague instructions wrapped in Markdown that the AI half-follows. The difference between a 2/5 skill and a 5/5 skill is measurable: a high-quality skill produces consistent output across sessions, requires fewer follow-up corrections, and saves 15-30 minutes per use. A low-quality skill is marginally better than having no skill at all.

This guide introduces a practical scoring framework with 5 dimensions, a self-assessment checklist, and a complete before/after transformation showing how to take a skill from score 1 to score 5, one dimension at a time.

The 5 Quality Dimensions

Each dimension is binary: 0 (missing/inadequate) or 1 (present/adequate). Total score ranges from 0 to 5. This isn't a subjective rubric - each dimension has concrete criteria you can check in 30 seconds.

Dimension 1: Clarity

Clarity measures whether someone who has never seen your skill can understand what it does within 10 seconds of reading it. This means the name is self-explanatory, the description is one sentence of plain English, and the instructions don't use unexplained jargon or assume context that isn't provided.

The test is simple: show the skill to a colleague who didn't write it. Can they tell you what it does without asking questions? If they need to read the whole document to understand the purpose, clarity is 0. If the name and first sentence tell the story, clarity is 1.

Clarity failures compound with team size. A solo developer can tolerate a cryptically-named skill because they wrote it and remember the context. On a team of 5, that same skill becomes a mystery that nobody uses because nobody knows what it does.

Score 0: Unclear

SKILL.md
1---
2name: cr-v2
3description: does code stuff
4---
5Check things and report.

Name is an abbreviation. Description is meaningless. Instructions say nothing.

Score 1: Clear

SKILL.md
1---
2name: code-review
3description: Reviews code for bugs, security
4 issues, and maintainability problems
5---
6Perform a structured code review...

Name is descriptive. Description tells you what it does and what it checks for.

Common pitfall:Using abbreviations in skill names. “cr” could mean code review, change request, or carriage return. Always spell it out.

Dimension 2: Specificity

Specificity measures whether the instructions are concrete enough that two different AI models would produce the same output. Vague instructions like “review code carefully” are interpreted differently by every model in every session. Specific instructions like “check for null handling on every function parameter” produce consistent results.

The litmus test: can you replace your skill's instructions with “do a good job” and get meaningfully different output? If not, your skill isn't specific enough. Every instruction should constrain the AI's behavior in a way that “do a good job” doesn't.

Specificity is the dimension with the highest impact on output quality. A skill that scores 1 on specificity and 0 on everything else still outperforms a skill that scores 1 on all other dimensions but 0 on specificity. Concrete steps beat good formatting every time.

Score 0: Vague

markdown
1Review the code thoroughly and provide
2helpful feedback on quality and style.

Every AI already tries to do this. Zero constraining value.

Score 1: Specific

markdown
11. Check every function for: null params,
2 missing return types, unhandled exceptions
32. Flag functions over 30 lines
43. Format: [Severity] (Line N): Title

Concrete checklist. Measurable thresholds. Output format defined.

Common pitfall:Using adjectives instead of criteria. “Clean code” is subjective. “Functions under 30 lines with no more than 3 parameters” is measurable.

Dimension 3: Structure

Structure measures whether the skill is organized with headings, lists, and clear sections that the AI can parse efficiently. A wall of text is harder for the AI to follow than a numbered list with bold headings. Structure isn't about aesthetics - it directly affects how well the AI extracts and follows instructions.

LLMs process structured text better than prose. When instructions are in a numbered list, the model can track “I've completed step 3, now I need step 4.” When instructions are in a paragraph, the model has to parse which sentence is an instruction and which is context. Numbered steps reduce instruction-skipping by roughly 40% in practice.

The ideal structure for a skill is: heading (“Steps”), numbered list of actions, heading (“Examples”), input/output pairs, heading (“Don't”), bullet list of prohibitions. This is the pattern that tools like Claude Code and Codex CLI are optimized for.

Score 0: Unstructured

markdown
1You should review the code and check for
2bugs. Also look for security issues. Make
3sure to provide examples. Don't forget to
4check error handling too. Format your
5output nicely.

Score 1: Structured

markdown
1## Steps
21. **Bugs**: null handling, off-by-one
32. **Security**: injection, auth bypass
43. **Errors**: unhandled exceptions
5
6## Output format
7[Severity] (Line N): Title - fix

Common pitfall: Over-structuring. A skill with 10 heading levels and 50 bullet points is as hard to follow as a wall of text. Keep it to 2-3 sections with 4-8 items each.

Dimension 4: Completeness

Completeness measures whether the skill includes all components needed for reliable AI behavior: examples, edge cases, triggers, and negative examples (Don'ts). A skill with perfect instructions but no examples is like a recipe with ingredient quantities but no photos of the final dish - the cook has to guess what “done” looks like.

The minimum bar for completeness is: at least 2 input/output examples, a Don't section with 3+ items, and trigger phrases. The examples show the AI the expected format and depth. The Don't section prevents the most common unwanted behaviors. The triggers enable automatic activation.

Completeness is the dimension most developers skip because it feels like extra work. Writing 2 examples takes 10 minutes. But those 10 minutes prevent hours of correcting AI output that almost-but-not-quite matches what you wanted. Examples are the single highest-ROI component of any skill.

Score 0: Incomplete

markdown
1## Steps
21. Review the code
32. Report issues
4(no examples, no triggers, no don'ts)

Score 1: Complete

markdown
1## Steps (4 numbered items)
2## Examples (2 input/output pairs)
3## Triggers (5 phrases)
4## Don't (4 prohibitions)

Common pitfall:Examples that are too simple. Showing a trivial case doesn't teach the AI how to handle complexity. Include at least one example with an edge case or non-obvious scenario.

Dimension 5: Actionability

Actionability measures whether every instruction tells the AI what to do, not what to know. “You are an expert code reviewer” is knowledge (the AI ignores it - it's already trying to be helpful). “For each function, check: null handling, error propagation, return type correctness” is action (the AI does something specific).

The easiest test: read each sentence in your skill and ask “can the AI act on this right now?” If the answer is “it depends” or “it's background info,” the sentence isn't actionable. Cut it or convert it to an action step. Every sentence should either be an instruction the AI follows or an example it imitates.

Actionability is what separates skills that “kinda work” from skills that “just work.” A skill with high actionability produces the same output regardless of the model's mood, the conversation context, or whether it's 2 AM and the API is slow. The instructions are so concrete that interpretation is minimized.

Score 0: Knowledge-based

markdown
1You are an expert code reviewer with deep
2knowledge of security best practices and
3software architecture principles.

Score 1: Action-based

markdown
11. Read the entire diff before commenting
22. For each function, check: null params,
3 missing error handling, SQL injection
43. Format: **[Bug] (Line N):** description

Common pitfall:Starting with “You are...” or “Act as...” preambles. These waste tokens and don't change behavior. Skip the identity statement and go straight to instructions.

Self-Assessment: 15 Questions

Answer these yes/no questions for any skill. Count the “yes” answers in each group of 3 to get your dimension score (0 or 1 - score 1 if at least 2 of 3 are “yes”).

Clarity

1.

Can someone understand what this skill does from the name alone?

2.

Is the description one clear sentence with no jargon?

3.

Could a junior developer use this skill without asking you what it means?

Score 1 if at least 2 answers are “yes”

Specificity

1.

Does every instruction include measurable criteria (numbers, thresholds, formats)?

2.

Would replacing your instructions with 'do a good job' change the AI's output?

3.

Are there zero sentences that use vague adjectives (thorough, careful, comprehensive)?

Score 1 if at least 2 answers are “yes”

Structure

1.

Are instructions in a numbered list (not a paragraph)?

2.

Does the skill have distinct sections with headings (Steps, Examples, Don't)?

3.

Can you find any specific instruction within 5 seconds of opening the file?

Score 1 if at least 2 answers are “yes”

Completeness

1.

Are there at least 2 input/output examples?

2.

Is there a Don't section with 3+ items?

3.

Are there trigger phrases for automatic activation?

Score 1 if at least 2 answers are “yes”

Actionability

1.

Does every sentence tell the AI what to DO (not what to KNOW or BE)?

2.

Are there zero 'You are...' or 'Act as...' preambles?

3.

Could the AI execute each instruction without asking for clarification?

Score 1 if at least 2 answers are “yes”

Scoring math: Add up dimension scores. Total = Clarity + Specificity + Structure + Completeness + Actionability. Range: 0-5. Most skills start at 1-2. Target: 4-5.

Full Transformation: Score 1 to Score 5

Let's take a real skill and improve it step by step through all 5 dimensions. Each step adds one dimension, and you can see exactly how the quality and output consistency improve.

Starting Point: Score 1/5

SKILL.md
1---
2name: test-gen
3---
4You are an expert test writer. Write comprehensive tests for
5the given code. Make sure to cover edge cases and use good
6testing practices. The tests should be thorough.

Score: Clarity 1 (name is okay), Specificity 0, Structure 0, Completeness 0, Actionability 0 = 1/5. The name gives a hint, but everything else is vague platitudes.

Step 1: Add Specificity (Score 2/5)

SKILL.md
1---
2name: test-gen
3---
4You are an expert test writer. When writing tests:
51. Write a happy path test for the most common input
62. Write edge case tests: empty input, null, boundary values,
7 negative numbers, max length strings
83. Write error case tests: invalid types, network failures,
9 permission denied
104. Use descriptive names: "should return empty array when
11 user has no orders"
125. One assertion per test (or closely related assertions)
136. Use the project's test framework (Jest/Vitest)

Score: Clarity 1, Specificity 1, Structure 0, Completeness 0, Actionability 0 = 2/5. Now we have concrete steps, but they're mixed into the preamble and there are no examples.

Step 2: Add Structure (Score 3/5)

SKILL.md
1---
2name: test-writer
3description: Generates tests with edge cases and meaningful assertions
4version: 1.0.0
5tags: [testing, quality]
6---
7
8# Test Writer
9
10## Steps
111. **Happy path** - test the most common successful usage
122. **Edge cases** - empty input, null, boundary values, negative numbers
133. **Error cases** - invalid types, network failure, permission denied
144. **Naming** - "should [expected result] when [condition]"
155. **Assertions** - one per test, testing behavior not implementation
166. **Framework** - use Jest/Vitest with the project's conventions

Score: Clarity 1, Specificity 1, Structure 1, Completeness 0, Actionability 0 = 3/5. Now it's organized with headings and a clean numbered list. But still no examples or negative rules.

Step 3: Add Completeness (Score 4/5)

SKILL.md
1---
2name: test-writer
3description: Generates tests with edge cases and meaningful assertions
4version: 1.0.0
5tags: [testing, quality]
6---
7
8# Test Writer
9
10## Steps
111. **Happy path** - test the most common successful usage
122. **Edge cases** - empty, null, boundary, negative, max length
133. **Error cases** - invalid types, network failure, permission denied
144. **Naming** - "should [expected result] when [condition]"
155. **Assertions** - one per test, test behavior not implementation
16
17## Examples
18
19### Given: calculateDiscount(price, tier)
20```typescript
21it("should apply 20% for gold tier", () => {
22 expect(calculateDiscount(100, "gold")).toBe(80);
23});
24it("should return 0 for price of 0", () => {
25 expect(calculateDiscount(0, "gold")).toBe(0);
26});
27```
28
29## Triggers
30- "write tests for"
31- "add test coverage"
32- "test this function"
33
34## Don't
35- Don't write tests that only check toBeDefined()
36- Don't mock everything - only external dependencies
37- Don't test private methods or internal state
38- Don't skip the edge cases for "simplicity"

Score: Clarity 1, Specificity 1, Structure 1, Completeness 1, Actionability 0 = 4/5. We have examples, triggers, and Don'ts. The only gap is actionability - the “You are...” preamble is gone but let's make every instruction truly action-oriented.

Step 4: Add Actionability (Score 5/5)

SKILL.md
1---
2name: test-writer
3description: Generates tests with edge cases and meaningful assertions
4version: 1.1.0
5tags: [testing, quality, jest, vitest]
6---
7
8# Test Writer
9
10Generate thorough tests for the given code.
11
12## Steps
13
141. **Analyze the function** - list its inputs, outputs, and side effects
152. **Write happy path test** - most common successful input/output
163. **Write 3+ edge case tests** - empty input, null, 0, boundary
17 values, negative numbers, max-length strings
184. **Write 2+ error case tests** - invalid type, missing required
19 param, network timeout, permission denied
205. **Name each test** - "should [verb] [expected] when [condition]"
216. **Assert specific values** - use toBe(80), not toBeTruthy()
227. **Add setup/teardown** if tests share state or need mocks
23
24## Examples
25
26### Given: calculateDiscount(price: number, tier: string): number
27```typescript
28describe("calculateDiscount", () => {
29 it("should apply 20% discount for gold tier", () => {
30 expect(calculateDiscount(100, "gold")).toBe(80);
31 });
32 it("should return 0 when price is 0", () => {
33 expect(calculateDiscount(0, "gold")).toBe(0);
34 });
35 it("should handle negative prices by returning 0", () => {
36 expect(calculateDiscount(-50, "gold")).toBe(0);
37 });
38 it("should throw on invalid tier", () => {
39 expect(() => calculateDiscount(100, "platinum")).toThrow();
40 });
41});
42```
43
44## Triggers
45- "write tests for"
46- "add test coverage"
47- "test this function"
48- "create unit tests"
49
50## Don't
51- Don't write tests that only check toBeDefined() or toBeTruthy()
52- Don't mock everything - only mock external dependencies (DB, API)
53- Don't test implementation details (private methods, internal state)
54- Don't skip edge cases to save time

Score: Clarity 1, Specificity 1, Structure 1, Completeness 1, Actionability 1 = 5/5. Every instruction is an action. “Write 3+ edge case tests” is something the AI can do immediately. “Analyze the function” with specific sub-tasks (list inputs, outputs, side effects) tells the AI exactly what analyzing means.

Impact of Each Score Level

To illustrate the practical difference, here's what you can expect at each score level when using a code review skill:

Score 1

Generic feedback. 'This code could be improved.' No line references. Mix of useful and useless comments. You spend 5 minutes filtering signal from noise.

Score 2

Some specific findings but inconsistent format. Catches obvious bugs but misses subtle ones. Output varies between sessions.

Score 3

Organized findings with categories. Consistent format. Catches most bugs. But sometimes flags things you don't care about (style, formatting).

Score 4

Line-specific findings with severity levels. Consistent every time. Skips formatting issues. But occasionally misses edge cases you'd catch manually.

Score 5

Precise, line-referenced findings in your exact format. Catches null handling, security issues, and maintainability problems in priority order. Consistent across sessions and models. You trust the output.

Using the Score

Score every skill you write against this rubric before deploying it. A 3/5 is your minimum for production use - below that, the skill is too unreliable to depend on. Focus improvement effort on the missing dimensions, starting with Specificity (highest impact) and Completeness (second highest).

For teams, make the scoring part of your skill review process. When someone proposes a new skill, score it as a team. Reject anything below 3/5. It takes 30 minutes to improve a 2/5 to a 4/5 - and that 30 minutes saves hours of inconsistent AI output across the team.

The scoring framework isn't subjective. Two people scoring the same skill will arrive at the same number because each dimension has binary, verifiable criteria. This makes it useful for team standards: “All skills in our library must score at least 4/5” is an enforceable rule, not a vague aspiration.

Manage your skills with Praxl

Edit once, deployed to every AI tool. Version history, AI review, team sharing.

Try Praxl free