What Still Needs Work (POS Part 5)
On this page
Following our discussion on Cross-Tool Compatibility in Part 4, this final post takes an honest look at the system’s current limitations.
What Works Well
Before cataloging gaps, it’s worth noting what POS does well:
- Context persistence: Handoff records genuinely solve the “where was I?” problem across sessions
- Multi-tool support: The file-based architecture works with every AI tool tested so far
- Config-driven generation: Adding contexts, teams, and MCP integrations is a one-file edit
- Skill consistency: All 30 skills use the template system with shared partials and drift detection
- Low maintenance: Shell scripts, YAML, Markdown. No dependencies to update, no services to maintain
These are solved problems. They work reliably and don’t need fundamental rethinking.
The Gaps
Skill Coverage Is Uneven
Some skills are detailed and battle-tested (debugging, code-review, plan-generation). Others are thin wrappers that add little value over freeform prompting. The self-rating feedback loop is a start, but LLM-as-judge evaluation would systematically identify weak skills. Send each skill to a model, ask it to rate clarity and completeness, and flag anything scoring below 7.
No Automated Skill Testing
Skills are validated for format (frontmatter, placeholders, tool names) but not for behavior. There’s no test that verifies a skill actually produces good output when invoked. End-to-end testing would spawn a real session, invoke the skill with a fixture input, and verify the output. The challenge is cost. A full suite across 30 skills would run $50-100 per execution.
Manual Session Management for Non-Claude Tools
Claude Code gets automatic session registration via the SessionStart hook. Every other tool requires manual script execution or YAML writing. Tool-specific hooks or plugins would reduce this friction, but every tool’s extension system is different. Some have mature task runners. Others have limited or no hook mechanisms.
No Visual Dashboard
All system state is accessible through YAML files and scripts, but there’s no visual overview. A lightweight web dashboard that reads .state/snapshot.yaml and renders it as HTML would improve oversight without changing the core system. Static generation would be sufficient. No server needed.
Artifact Discovery Is Passive
Cross-skill artifacts exist, but consumer skills don’t automatically load them. An artifact manifest auto-loaded with context would solve this. When a context activates, generate a short summary of available artifacts and include it in the session metadata.
Grounding and Verification
The preamble partial now instructs skills to read actual state before working, and the verification partial requires checking output before claiming success. But these are instructions, not enforcement. A skill can still skip verification if the AI decides it’s unnecessary. Programmatic verification, where a script runs tests automatically after skill completion, would close this gap.
Roadmap Priorities
In order of impact:
- Artifact manifest in context loading: Low effort, high value for cross-skill coordination
- LLM-as-judge skill evaluation: Identifies which skills to improve
- Dashboard: Improves human oversight without changing the core system
- E2E skill testing: Quality assurance for skill behavior
- Per-tool session hooks: Reduces friction for non-Claude tools
- Programmatic post-skill verification: Auto-run tests after skill completion
What I Learned Building This
The value isn’t in the files. It’s in the discipline. YAML files and shell scripts are implementation details. The real value is the habit of ending every session with a handoff, starting every task with a plan, and reviewing every skill for improvement. The files enforce the discipline, but the discipline produces the results.
AI tools improve faster than your workflow can track. Every quarter, a major tool ships a feature that makes part of the system redundant or enables something new. The file-based architecture absorbs these changes well because it doesn’t depend on any tool’s specific capabilities. But staying current requires active attention.
The hardest part isn’t building the system. It’s maintaining the habit of using it. On a busy day with five context switches and three urgent bugs, it’s tempting to skip the handoff, skip the plan, skip the self-rating. Every shortcut makes the next session slightly worse. The system degrades gracefully, but it degrades. The scripts and templates lower the activation energy for doing things properly, but they can’t eliminate it entirely.
Start with the pain, not the architecture. POS didn’t begin as a grand design. It began with a single status.yaml file because I kept forgetting which sprint each project was on. Then handoffs, because I kept re-explaining context. Then skills, because I kept writing the same prompts. Every component exists because a specific pain point demanded it. The components that I built speculatively before feeling the pain are the ones that needed the most revision.
Shared conventions beat per-tool optimization. I spent weeks tuning Claude Code-specific configurations before realizing that the 80% solution (a markdown file any tool can read) was more valuable than the 100% solution for one tool. The portable skill system exists because I learned this lesson the hard way.
If You Build Your Own
If this series has you thinking about building something similar, here is practical advice from the mistakes I’ve already made:
Start with three skills, not thirty. Pick the three workflows you repeat most often. Likely code review, planning, and debugging. Write those skills, use them for two weeks, and iterate. The patterns that emerge from real usage are better than anything you can design upfront.
Handoffs are the highest-ROI feature. If you build only one thing from this series, build the handoff protocol. A simple YAML file written at the end of each session. What was done, what’s left, where to resume. Eliminates the single biggest source of wasted time in AI-assisted workflows.
Don’t over-engineer the template system early. I built the full partial and template system after 15 skills existed and I had already experienced the pain of updating the same section across all of them. If you have three skills, you don’t need templates. Copy-paste is fine until it isn’t.
Keep the config file honest. pos.yaml should reflect what you actually manage, not what you aspire to manage. Listing twelve contexts when you actively work on four creates noise that dilutes the system’s usefulness. Add contexts when the work demands it, not before.
Version everything. Every file in POS lives in git. This means every change to every skill, every handoff, every status update has a history. When something breaks, git log tells you what changed. When a skill regresses, git diff shows exactly what was modified. This is free audit trail that costs nothing to maintain.
Expect to rebuild twice. The first version of POS was a mess of scattered markdown files with no conventions. The second version introduced structure but was over-engineered. The current version is the third. It finds a balance between flexibility and consistency. Your first version will also be wrong, and that’s fine. Build it, use it, and rebuild it when the pain points become clear.
Closing Thoughts
POS isn’t finished. It’s a working system that solves real problems. Context persistence, multi-tool coordination, multi-project management. But it has clear areas where it could be better.
The architecture is sound. Files, scripts, templates, and conventions. The improvement mechanisms are in place. Validation catches breakage, self-rating surfaces friction, artifacts connect skills. What remains is execution on the roadmap and continued iteration driven by real usage.
The system started as a practical solution to context chaos. It grew into a framework for coordinating AI tools across multiple professional contexts. Every piece exists because a real workflow problem demanded it. The gaps exist because those problems haven’t been painful enough to prioritize yet.
That’s the nature of a personal operating system. It evolves with the person using it.
POS is open source. Clone the template and build your own: github.com/abuango/pos-ai
This is part 5 of a 5-part series on Building a Personal Operating System for AI-Assisted Development.