Standards-Driven Glob Categorization for Cursor and GitHub Formatters¶
Issue: #165 Date: 2026-03-25 Status: Approved
Problem¶
The Cursor formatter's extractGlobs() and GitHub formatter's extractInstructions() methods use hardcoded string matching to categorize @guards.globs patterns into buckets:
.ts/.tsx→typescriptbuckettest/spec/__tests__→testingbucket- everything else →
filesbucket (Cursor only; GitHub has nofilesbucket)
This causes several issues:
- Non-TS languages ignored:
**/*.pyends up in a genericfilesbucket with no content - Misclassification:
**/*.test.tsxis classified astesting(nottypescript) even though it's both - No extensibility: Only 2 hardcoded categories exist
- Hardcoded content: The
testingbucket outputs"Follow project testing conventions and patterns."instead of pulling from@standards.testing
Both formatters have the same bug (though GitHub lacks the files fallback bucket). Named entries (@guards.security: { applyTo: [...] }) already bypass the heuristic and work correctly — this fix only affects the @guards.globs flat array path.
Solution: Standards-Category-Driven Auto-Split¶
Replace the string-matching heuristic with @standards-category-driven assignment. Instead of guessing categories from glob patterns, look at what @standards categories actually exist in the AST and assign globs to matching categories.
Category-to-Pattern Mapping¶
A shared mapping table defines which glob substrings are associated with which @standards category. Only categories that exist in the user's @standards block are considered during matching. CATEGORY_GLOB_HINTS is an exported module-level constant in glob-categorizer.ts (exported for testability).
export const CATEGORY_GLOB_HINTS: Record<string, string[]> = {
// === Languages ===
typescript: ['.ts', '.tsx', '.mts', '.cts'],
javascript: ['.js', '.jsx', '.mjs', '.cjs'],
python: ['.py', '.pyi', '.pyw'],
java: ['.java'],
c: ['.c'],
cpp: ['.cpp', '.cxx', '.cc', '.hpp'],
csharp: ['.cs', '.csx'],
go: ['.go'],
php: ['.php', '.phtml'],
sql: ['.sql'],
r: ['.r', '.R', '.Rmd'],
swift: ['.swift'],
rust: ['.rs'],
kotlin: ['.kt', '.kts'],
ruby: ['.rb', '.erb', '.rake'],
perl: ['.pl', '.pm'],
vb: ['.vb', '.vbs'],
delphi: ['.pas', '.dpr'],
fortran: ['.f90', '.f95', '.f03', '.f08'],
dart: ['.dart'],
matlab: ['.mlx'],
scala: ['.scala', '.sc'],
objectivec: ['.m', '.mm'],
shell: ['.sh', '.bash', '.zsh', '.fish'],
powershell: ['.ps1', '.psm1', '.psd1'],
lua: ['.lua'],
haskell: ['.hs', '.lhs'],
julia: ['.jl'],
groovy: ['.groovy', '.gvy'],
elixir: ['.ex', '.exs'],
clojure: ['.clj', '.cljs', '.cljc'],
fsharp: ['.fs', '.fsi', '.fsx'],
erlang: ['.erl', '.hrl'],
cobol: ['.cob', '.cbl'],
ada: ['.adb', '.ads'],
lisp: ['.lisp', '.lsp', '.cl'],
scheme: ['.scm', '.ss'],
assembly: ['.asm'],
solidity: ['.sol'],
zig: ['.zig'],
nim: ['.nim', '.nims'],
crystal: ['.cr'],
elm: ['.elm'],
ocaml: ['.ml', '.mli'],
abap: ['.abap'],
racket: ['.rkt'],
d: ['.d'],
hack: ['.hack'],
coffeescript: ['.coffee'],
gleam: ['.gleam'],
mojo: ['.mojo'],
cairo: ['.cairo'],
move: ['.move'],
pony: ['.pony'],
ballerina: ['.bal'],
reason: ['.re', '.rei', '.res', '.resi'],
// === Web / UI frameworks (unique extensions only) ===
vue: ['.vue'],
svelte: ['.svelte'],
astro: ['.astro'],
angular: ['.component.ts', '.module.ts', '.service.ts', '.directive.ts', '.pipe.ts'],
blade: ['.blade.php'],
heex: ['.heex'],
razor: ['.razor', '.cshtml'],
haml: ['.haml'],
// === Styling ===
css: ['.css'],
scss: ['.scss', '.sass'],
less: ['.less'],
stylus: ['.styl'],
// === Markup / Data ===
html: ['.html', '.htm'],
markdown: ['.md', '.mdx'],
xml: ['.xml', '.xsl', '.xsd'],
// === Infrastructure / DevOps ===
terraform: ['.tf', '.tfvars', '.hcl'],
puppet: ['.pp'],
saltstack: ['.sls'],
// === Schema / API ===
graphql: ['.graphql', '.gql'],
protobuf: ['.proto'],
prisma: ['.prisma'],
thrift: ['.thrift'],
avro: ['.avsc'],
// === Data Science ===
jupyter: ['.ipynb'],
// === Hardware description ===
verilog: ['.sv', '.svh'],
vhdl: ['.vhd', '.vhdl'],
// === Testing (pattern-based) ===
testing: ['test', 'spec', '__tests__'],
// === Cucumber / BDD ===
gherkin: ['.feature'],
// === Storybook ===
storybook: ['.stories.ts', '.stories.tsx', '.stories.js', '.stories.jsx', '.stories.mdx'],
};
Extension table notes:
- Objective-C includes
.m(most common) and.mm. The.mextension conflicts with MATLAB source files. Since only categories present in@standardsare checked, this is only an issue if both@standards.objectivecand@standards.matlabexist. In that case,**/*.mmatchesobjectivec(because MATLAB has no.mhint — only.mlx). Users needing both should use named entries for disambiguation. - MATLAB uses
.mlxonly (not.matwhich is a binary data format, nor.mwhich is more commonly associated with Objective-C in coding contexts). Users who want.mfiles matched to MATLAB should use named entries. - Fortran omits bare
.fto avoid false matches; uses.f90+ variants only. - Hint matching is case-sensitive. The R language includes both
.rand.Rvariants in its hints. Categories for case-sensitive extensions must list all common casing variants explicitly.
Matching Algorithm¶
The source of truth for "category exists" is the StandardsExtractor: a category is considered to exist only if standardsExtractor.extract() returns it in codeStandards with non-empty items. An empty @standards.testing: {} does not count as an existing category.
1. Run standardsExtractor.extract(standardsContent) to get codeStandards map
2. For each glob pattern in the array:
a. Collect all (category, hint) pairs where:
- category exists in codeStandards (has non-empty items)
- category has an entry in CATEGORY_GLOB_HINTS
- hint appears in the glob string
- hint passes boundary detection (see below)
b. From all matching pairs, pick the winner using this priority:
1. LONGEST hint wins (e.g., ".component.ts" (13 chars) beats ".ts" (3 chars))
2. If tied on length: pattern-based hints (no leading dot) beat
extension-based hints (leading dot), so "test" (4 chars) beats
".tsx" (4 chars)
3. If still tied: first category in CATEGORY_GLOB_HINTS iteration order wins
c. Assign glob to the winning category's bucket
d. If no category matches -> discard (no "files" bucket in output)
3. For each non-empty bucket:
- Format codeStandards items as a bullet list ("- item1\n- item2\n...")
- Return as CategorizedGlobs entry
4. Entries with empty content are excluded from the result array
Tie-breaking examples:
**/*.component.tswith both@standards.angularand@standards.typescript→angularwins (.component.tsis 13 chars vs.tsat 3 chars — longest hint)**/*.test.tsxwith both@standards.testingand@standards.typescript→testingwins (testat 4 chars ties.tsxat 4 chars, buttestis pattern-based (no dot) which beats extension-based.tsx)**/*.spec.jswith both@standards.testingand@standards.javascript→testingwins (specat 4 chars vs.jsat 3 chars — longest hint)
Single category assignment: Each glob is assigned to exactly one category. A **/*.test.tsx glob will not appear in both testing and typescript buckets. This is acceptable because:
- Named entries (
@guards) provide full control for users who need multi-category assignment - Duplicating globs across categories would mean the same file triggers multiple rule sets, which may be undesirable
Boundary Detection¶
A hint matches a glob if the glob contains the hint and both surrounding characters pass boundary checks:
- Character before the hint (if any): must NOT be a letter (
[a-zA-Z]) - Character after the hint (if any): must NOT be a letter (
[a-zA-Z])
This prevents false matches in both directions:
| Glob | Hint | Before | After | Match? |
|---|---|---|---|---|
**/*.c | .c | * | end-of-string | Yes |
**/*.cpp | .c | * | p (letter) | No |
**/*.cs | .c | * | s (letter) | No |
**/*.cs | .cs | * | end-of-string | Yes |
**/*.component.ts | .component.ts | * | end-of-string | Yes |
**/*.test.ts | test | . | . | Yes |
**/attest.ts | test | t (letter) | . | No |
**/__tests__/** | __tests__ | / | / | Yes |
**/contest/** | test | n (letter) | / | No |
For extension-based hints (starting with .), the "before" check naturally passes because the dot itself acts as a non-letter boundary.
For pattern-based hints (testing category: test, spec, __tests__), the before-check prevents matching substrings like attest, contest, inspect.
Content Generation¶
Replace all hardcoded content with StandardsExtractor lookups:
For each category bucket:
1. Look up the category key in extracted.codeStandards map
2. Format StandardsEntry.items as bullet list ("- item1\n- item2\n...")
3. This is the content field in CategorizedGlobs
The GlobCategorizer is responsible for formatting the StandardsEntry.items array into a content string (bullet list). The caller receives ready-to-use content. Entries where extraction yields no items are excluded from the result array entirely (the categorize() method never returns entries with empty content).
This removes:
- Cursor's hardcoded
if (config.name === 'typescript')/if (config.name === 'testing')ingenerateGlobFile() - GitHub's
getTypeScriptInstructionContent()andgetTestingInstructionContent()methods
Shared Implementation¶
Both formatters share the same heuristic bug, so the fix is extracted into a shared utility.
New file: packages/formatters/src/extractors/glob-categorizer.ts¶
export interface CategorizedGlobs {
/** Category key matching an @standards entry (e.g., "typescript", "testing") */
name: string;
/** Glob patterns assigned to this category */
patterns: string[];
/** Formatted content from @standards.<category> as bullet list; always non-empty */
content: string;
/** Human-readable description (e.g., "TypeScript-specific rules") */
description: string;
}
export class GlobCategorizer {
constructor(private standardsExtractor: StandardsExtractor) {}
/**
* Categorize glob patterns by matching them against @standards categories.
*
* @param globs - flat array of glob pattern strings from @guards.globs
* @param standardsContent - the BlockContent from the @standards block, or null if no @standards exists
* @returns array of CategorizedGlobs, one per matched category.
* Excludes unmatched globs and categories with empty content.
* Returns empty array when standardsContent is null.
*/
categorize(globs: string[], standardsContent: BlockContent | null): CategorizedGlobs[];
}
The description field is generated as "${title}-specific rules" where title comes from the getSectionTitle() utility in extractors/types.ts. For categories not in DEFAULT_SECTION_TITLES, the title is derived by capitalizing the key (e.g., python → Python, objectivec → Objectivec). The DEFAULT_SECTION_TITLES map should be extended to include common entries from CATEGORY_GLOB_HINTS that need custom titles (e.g., objectivec → Objective-C, csharp → C#, cpp → C++, fsharp → F#).
Formatter changes¶
| Formatter | Remove | Add |
|---|---|---|
Cursor extractGlobs() | Hardcoded ts/test/other pattern filtering from globs array | GlobCategorizer.categorize() call, convert results to GlobConfig[] |
Cursor generateGlobFile() | Hardcoded content branches for typescript and testing names | Use config.content uniformly (already works for named entries) |
GitHub extractInstructions() | Hardcoded ts/test pattern filtering from globs array | GlobCategorizer.categorize() call, convert results to InstructionConfig[] |
| GitHub | getTypeScriptInstructionContent(), getTestingInstructionContent() methods | Removed (dead code) |
Named entries path (@guards.key: { applyTo: [...] }) stays untouched in both formatters.
Backward Compatibility¶
This is a behavioral change for users relying on the old heuristic:
Before: @guards.globs: ["**/*.ts", "**/*.spec.ts"] with @standards.typescript but no @standards.testing produces both typescript.mdc and testing.mdc (with hardcoded content).
After: Same input produces only typescript.mdc (with .spec.ts included). No testing.mdc because there's no @standards.testing key.
This is correct and desirable — the old behavior guessed wrong. Users who want a testing file should either:
- Add
@standards.testing: { ... }(and.spec.tsglobs will matchtestingfirst sincetestinghints includespec) - Use named entries in
@guardsfor full control
Note: The files bucket behavior is unchanged — it already produced no output file in the current implementation (Cursor's generateGlobFile() returns null when sections.length <= 1). GitHub never had a files bucket.
No deprecation warning needed. The globs array still works; it just produces smarter output.
Test Strategy¶
Unit tests: packages/formatters/src/extractors/__tests__/glob-categorizer.spec.ts¶
- Glob with
.tsextension →typescriptcategory (when@standards.typescriptexists) - Glob with
.pyextension →pythoncategory (when@standards.pythonexists) - Glob with
specin path →testingcategory (when@standards.testingexists) - Glob with
.pybut no@standards.python→ not in any category (unmatched) - Boundary:
**/*.cdoes NOT matchcppcategory - Boundary:
**/*.csmatchescsharp, notcorcss - Boundary:
**/attest.tsdoes NOT matchtesting(letter beforetest) - Boundary:
**/contest/**does NOT matchtesting(letter beforetest) - Longest match:
**/*.component.ts→angularovertypescript - Pattern-based tiebreaker:
**/*.test.tsx→testingovertypescript - Multiple globs split across different categories
- Empty globs array → empty result
- Null
standardsContent(no@standardsblock) → empty result - Empty
@standards.testing: {}→ not treated as existing category - Content is formatted as bullet list from StandardsEntry items
Integration tests: Cursor formatter (cursor.spec.ts)¶
- Multifile with
@standards.typescript+ matching globs →typescript.mdcwith real content - Multifile with
@standards.testing→testing.mdcwith real content (not hardcoded) - Multifile with
@standards.python+**/*.py→python.mdc - Named entries still work unchanged
- No
@standards→ no glob-specific files generated
Integration tests: GitHub formatter (github.spec.ts)¶
- Same patterns as Cursor but generating
.instructions.mdfiles - Content comes from
StandardsExtractor, not removed helper methods