Before, when two file types matched the same glob, the file type that
one was non-deterministic.
Now, "the more specific" file type wins. What this means is that we
break up the file by its extensions and prioritize the more literal glob
- If its just `*`, then its lowest priority
- If it contains `*` and other logic, then its next
- If it doesn't contain a `*`, then its the highest priority
This leaves out other glob syntax like `{one,two}` as those are
closed-ended and so considered specific still.
Fixes#487
Previous method misaligns highlights when there are double width asian characters
```
39 | 한글 eglish
| ^^^^^^
```
This commit fixes the highlight to have correct alignment.
```
39 | 한글 eglish
| ^^^^^^
```
`unicode-rs` crate is used by the Rust compiler [1].
[1]: 34a6c9f26e/compiler/rustc_errors/src/emitter.rs (L861)
`go.mod` seems to be a specification file which we tend to lump in with
the language itself since a weirdly spell dependency will likely show up
in code.
`go.sum` seems to be like a lock file which we quarantine into its own
file type.
Fixes#458
First, this centralizes the concept of lock files, focusing on intent,
rather than syntax. We are assuming `requirements.txt` for Python is
being used like a regular lock file and not as a dependency
specification.
Second, we then ignore the content. Though a lock file will generally
contain things that could show up in a dependency specification, the
large dependency trees make that harder to manage. We still have the
dependency specification file which will match with the users code.
Fixes#445
For `rg`, keeping the file types strict makes sense, For spell
checking, `Cargo.toml` is a lot more closely related in handling to
`*.rs` than it is to `pyproject.toml` due to ecosystem package names.
Part of #362
This cuts varcon lookup times in half but I still suspect slower than
phf. Like with bsearch and unlike, the cost is consistent between hits
and misses.
At least this doesn't have the compile hit of PHF + unicase. Maybe I
should experiment with integrating a non-const-fn variant of unicase
with PHF and give up on all of this extra complexity.
Before, only some dicts did we guarentee were pre-sorted. Now, all are
for-sure pre-sorted.
This also gives each dict the size-check to avoid lookup.
But this is really about refactoring in prep for playing with other
lookup options, like tries.
This is prep for other items to be ignored
BREAKING CHANGE: `TokenizerBuilder` no longer takes config for ignoring
tokens. Related, we now ignore token-ignore config flags.
This is mostly to give implementation flexibility for changing out how
we store the data to reduce compilation memory usage.
This does have performance impact, jumping from ~220ns to ~320ns for a
dict lookup, according to our micro benchmarks.
When rendering typos, we look up what visual column the typoe starts on
but I mixed a raw byte offset with the offset into a lossy string. This
caused panics when dealing with non-ascii content.
Fixes#258