2021-04-30 21:41:32 -04:00
# Design
## Requirements
Spell checks source code:
- Requires special word-splitting logic to handle situations like hex (`0xDEADBEEF`), `c\nescapes` , `snake_case` , `CamelCase` , `SCREAMING_CASE` , and maybe `arrow-case` .
- Each programming language has its own quirks, like abbreviations, lack of word separator (`copysign`), etc
- Backwards compatibility might require keeping misspelled words.
- Case for proper nouns is irrelevant.
Checking for errors in a CI:
- No false-positives.
- On spelling errors, sets the exit code to fail the CI.
2023-03-13 16:08:12 -04:00
- Machine-independent, repo-specific configuration
- As compared to layered config with the users system or the command-line
2021-04-30 21:41:32 -04:00
Quick feedback and resolution for developer:
- Fix errors for the user.
- Integration into other programs, like editors:
- `fork` : easy to call into and provides a stable API, including output format
- linking: either in the language of choice or bindings can be made to language of choice.
## Trade Offs
2021-04-30 21:49:01 -04:00
### Corrections vs Dictionaries
2021-04-30 21:41:32 -04:00
2021-04-30 21:49:01 -04:00
Corrections: Known misspellings that map to their corresponding dictionary word
2021-04-30 21:41:32 -04:00
- Ignores unknown typos
- Ignores typos that follow c-escapes if they aren't handled correctly
2021-07-27 16:22:17 -04:00
- Good for unassisted automated correcting
- Fast, can quickly run across large code bases
2021-04-30 21:41:32 -04:00
2021-04-30 21:49:01 -04:00
Dictionary: A confidence rating is given for how close a word is to one in a dictionary
2021-04-30 21:41:32 -04:00
- Sensitive to false positives due to hex numbers and c-escapes
2021-07-27 16:22:17 -04:00
- Used in word processors and other traditional spell checking applications
- Good when there is a UI to let the user know and override any decisions
2023-01-03 08:06:02 -05:00
## Identifiers and Words
With a focus on spell checking source code, most text will be in the form of
identifiers that are made up of words conjoined via `snake_case` , `CamelCase` ,
etc. A typo at the word level might not be a typo as part of
an identifier, so identifiers get checked and, if not in a dictionary, will
then be split into words to be checked.
Identifiers are defined using
[unicode's `XID_Continue` ](https://www.unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers )
which includes `[a-zA-Z0-9_]` .
Words are split from identifiers on case changes as well as breaks in
`[a-zA-Z]` with a special case to handle acronyms. For example,
`First10HTMLTokens` would be split as `first` , `html` , `tokens` .
To see this in action, run `typos --identifiers` or `typos --words` .