typos/docs/design.md

# Design

## Requirements

Spell checks source code:
- Requires special word-splitting logic to handle situations like hex (`0xDEADBEEF`), `c\nescapes`, `snake_case`, `CamelCase`, `SCREAMING_CASE`, and maybe `arrow-case`.
- Each programming language has its own quirks, like abbreviations, lack of word separator (`copysign`), etc
- Backwards compatibility might require keeping misspelled words.
- Case for proper nouns is irrelevant.

Checking for errors in a CI:
- No false-positives.
- On spelling errors, sets the exit code to fail the CI.
- Machine-independent, repo-specific configuration
  - As compared to layered config with the users system or the command-line

Quick feedback and resolution for developer:
- Fix errors for the user.
- Integration into other programs, like editors:
  - `fork`: easy to call into and provides a stable API, including output format
  - linking: either in the language of choice or bindings can be made to language of choice.

## Trade Offs

### Corrections vs Dictionaries

Corrections: Known misspellings that map to their corresponding dictionary word
- Ignores unknown typos
- Ignores typos that follow c-escapes if they aren't handled correctly
- Good for unassisted automated correcting
- Fast, can quickly run across large code bases

Dictionary: A confidence rating is given for how close a word is to one in a dictionary
- Sensitive to false positives due to hex numbers and c-escapes
- Used in word processors and other traditional spell checking applications
- Good when there is a UI to let the user know and override any decisions

## Identifiers and Words

With a focus on spell checking source code, most text will be in the form of
identifiers that are made up of words conjoined via `snake_case`, `CamelCase`,
etc.  A typo at the word level might not be a typo as part of
an identifier, so identifiers get checked and, if not in a dictionary, will
then be split into words to be checked.

Identifiers are defined using
[unicode's `XID_Continue`](https://www.unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers)
which includes `[a-zA-Z0-9_]`.

Words are split from identifiers on case changes as well as breaks in
`[a-zA-Z]` with a special case to handle acronyms.  For example,
`First10HTMLTokens` would be split as `first`, `html`, `tokens`.

To see this in action, run `typos --identifiers` or `typos --words`.
docs: Re-organize to clarify intent This is part of #237 2021-04-30 21:41:32 -04:00			`# Design`

			`## Requirements`

			`Spell checks source code:`
			- Requires special word-splitting logic to handle situations like hex (`0xDEADBEEF`), `c\nescapes`, `snake_case`, `CamelCase`, `SCREAMING_CASE`, and maybe `arrow-case`.
			- Each programming language has its own quirks, like abbreviations, lack of word separator (`copysign`), etc
			`- Backwards compatibility might require keeping misspelled words.`
			`- Case for proper nouns is irrelevant.`

			`Checking for errors in a CI:`
			`- No false-positives.`
			`- On spelling errors, sets the exit code to fail the CI.`
docs(design): Note that we want non-transient, machine-independent config 2023-03-13 16:08:12 -04:00			`- Machine-independent, repo-specific configuration`
			`- As compared to layered config with the users system or the command-line`
docs: Re-organize to clarify intent This is part of #237 2021-04-30 21:41:32 -04:00
			`Quick feedback and resolution for developer:`
			`- Fix errors for the user.`
			`- Integration into other programs, like editors:`
			- `fork`: easy to call into and provides a stable API, including output format
			`- linking: either in the language of choice or bindings can be made to language of choice.`

			`## Trade Offs`

docs: Switch from blacklist language 2021-04-30 21:49:01 -04:00			`### Corrections vs Dictionaries`
docs: Re-organize to clarify intent This is part of #237 2021-04-30 21:41:32 -04:00
docs: Switch from blacklist language 2021-04-30 21:49:01 -04:00			`Corrections: Known misspellings that map to their corresponding dictionary word`
docs: Re-organize to clarify intent This is part of #237 2021-04-30 21:41:32 -04:00			`- Ignores unknown typos`
			`- Ignores typos that follow c-escapes if they aren't handled correctly`
docs: Call out dictionary model 2021-07-27 16:22:17 -04:00			`- Good for unassisted automated correcting`
			`- Fast, can quickly run across large code bases`
docs: Re-organize to clarify intent This is part of #237 2021-04-30 21:41:32 -04:00
docs: Switch from blacklist language 2021-04-30 21:49:01 -04:00			`Dictionary: A confidence rating is given for how close a word is to one in a dictionary`
docs: Re-organize to clarify intent This is part of #237 2021-04-30 21:41:32 -04:00			`- Sensitive to false positives due to hex numbers and c-escapes`
docs: Call out dictionary model 2021-07-27 16:22:17 -04:00			`- Used in word processors and other traditional spell checking applications`
			`- Good when there is a UI to let the user know and override any decisions`
docs(ref): Further clarify identifiers and words This supersedes #648 2023-01-03 08:06:02 -05:00
			`## Identifiers and Words`

			`With a focus on spell checking source code, most text will be in the form of`
			identifiers that are made up of words conjoined via `snake_case`, `CamelCase`,
			`etc. A typo at the word level might not be a typo as part of`
			`an identifier, so identifiers get checked and, if not in a dictionary, will`
			`then be split into words to be checked.`

			`Identifiers are defined using`
			[unicode's `XID_Continue`](https://www.unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers)
			which includes `[a-zA-Z0-9_]`.

			`Words are split from identifiers on case changes as well as breaks in`
			`[a-zA-Z]` with a special case to handle acronyms. For example,
			`First10HTMLTokens` would be split as `first`, `html`, `tokens`.

			To see this in action, run `typos --identifiers` or `typos --words`.