Previous method misaligns highlights when there are double width asian characters
```
39 | 한글 eglish
| ^^^^^^
```
This commit fixes the highlight to have correct alignment.
```
39 | 한글 eglish
| ^^^^^^
```
`unicode-rs` crate is used by the Rust compiler [1].
[1]: 34a6c9f26e/compiler/rustc_errors/src/emitter.rs (L861)
`go.mod` seems to be a specification file which we tend to lump in with
the language itself since a weirdly spell dependency will likely show up
in code.
`go.sum` seems to be like a lock file which we quarantine into its own
file type.
Fixes#458
First, this centralizes the concept of lock files, focusing on intent,
rather than syntax. We are assuming `requirements.txt` for Python is
being used like a regular lock file and not as a dependency
specification.
Second, we then ignore the content. Though a lock file will generally
contain things that could show up in a dependency specification, the
large dependency trees make that harder to manage. We still have the
dependency specification file which will match with the users code.
Fixes#445
For `rg`, keeping the file types strict makes sense, For spell
checking, `Cargo.toml` is a lot more closely related in handling to
`*.rs` than it is to `pyproject.toml` due to ecosystem package names.
Part of #362
This cuts varcon lookup times in half but I still suspect slower than
phf. Like with bsearch and unlike, the cost is consistent between hits
and misses.
At least this doesn't have the compile hit of PHF + unicase. Maybe I
should experiment with integrating a non-const-fn variant of unicase
with PHF and give up on all of this extra complexity.
Before, only some dicts did we guarentee were pre-sorted. Now, all are
for-sure pre-sorted.
This also gives each dict the size-check to avoid lookup.
But this is really about refactoring in prep for playing with other
lookup options, like tries.
This is prep for other items to be ignored
BREAKING CHANGE: `TokenizerBuilder` no longer takes config for ignoring
tokens. Related, we now ignore token-ignore config flags.
This is mostly to give implementation flexibility for changing out how
we store the data to reduce compilation memory usage.
This does have performance impact, jumping from ~220ns to ~320ns for a
dict lookup, according to our micro benchmarks.
When rendering typos, we look up what visual column the typoe starts on
but I mixed a raw byte offset with the offset into a lossy string. This
caused panics when dealing with non-ascii content.
Fixes#258
I didn't count on how `buffer.lines().count()` would handle different
corner cases, like slices with no `\n`, slices with trailing `\n`, etc.
Fixes#259
We want both CLI and config ignores. The question then is what we make
them relative to. I decided to favor CLI with `.`. We'll see how this
works out
Fixes#134
Variant support slows us down by 10-50$. I assume most people will run
with `en` and so most of this overhead is to waste. So instead of
merging vars with dict, let's instead get a quick win by just skipping
vars when we don't need to. If the assumptions behind this change over
time or if there is need for speeding up a specific locale, we can
re-address this.
Before:
```
check_file/Typos/code time: [35.860 us 36.021 us 36.187 us]
thrpt: [8.0117 MiB/s 8.0486 MiB/s 8.0846 MiB/s]
check_file/Typos/corpus time: [26.966 ms 27.215 ms 27.521 ms]
thrpt: [21.127 MiB/s 21.365 MiB/s 21.562 MiB/s]
```
After:
```
check_file/Typos/code time: [33.837 us 33.928 us 34.031 us]
thrpt: [8.5191 MiB/s 8.5452 MiB/s 8.5680 MiB/s]
check_file/Typos/corpus time: [17.521 ms 17.620 ms 17.730 ms]
thrpt: [32.794 MiB/s 32.999 MiB/s 33.184 MiB/s]
```
This puts us inline with `--no-default-features --features dict`
Fixes#253
These were found while running `typos` on Linux and inspecting a
sampling of the results. #249 represents additional changes to make.
There were some identifiers, that looked like hardware registers, that
I'm unsure of what can be done for them.