mirror of https://github.com/crate-ci/typos.git synced 2024-11-22 09:01:04 -05:00

Ed Page 59c4713e8b docs(ref): Further clarify identifiers and words

This supersedes #648

2023-01-03 07:06:08 -06:00

2.2 KiB

Raw Blame History

Design

Requirements

Spell checks source code:

Requires special word-splitting logic to handle situations like hex (0xDEADBEEF), c\nescapes, snake_case, CamelCase, SCREAMING_CASE, and maybe arrow-case.
Each programming language has its own quirks, like abbreviations, lack of word separator (copysign), etc
Backwards compatibility might require keeping misspelled words.
Case for proper nouns is irrelevant.

Checking for errors in a CI:

No false-positives.
On spelling errors, sets the exit code to fail the CI.

Quick feedback and resolution for developer:

Fix errors for the user.
Integration into other programs, like editors:
- fork: easy to call into and provides a stable API, including output format
- linking: either in the language of choice or bindings can be made to language of choice.

Trade Offs

Corrections vs Dictionaries

Corrections: Known misspellings that map to their corresponding dictionary word

Ignores unknown typos
Ignores typos that follow c-escapes if they aren't handled correctly
Good for unassisted automated correcting
Fast, can quickly run across large code bases

Dictionary: A confidence rating is given for how close a word is to one in a dictionary

Sensitive to false positives due to hex numbers and c-escapes
Used in word processors and other traditional spell checking applications
Good when there is a UI to let the user know and override any decisions

Identifiers and Words

With a focus on spell checking source code, most text will be in the form of identifiers that are made up of words conjoined via snake_case, CamelCase, etc. A typo at the word level might not be a typo as part of an identifier, so identifiers get checked and, if not in a dictionary, will then be split into words to be checked.

Identifiers are defined using unicode's XID_Continue which includes [a-zA-Z0-9_].

Words are split from identifiers on case changes as well as breaks in [a-zA-Z] with a special case to handle acronyms. For example, First10HTMLTokens would be split as first, html, tokens.

To see this in action, run typos --identifiers or typos --words.

2.2 KiB Raw Blame History