Commit graph

93 commits

Author SHA1 Message Date
Ed Page
c8d1058a71 refactor(dict): Change typos-dict to trie
This is +/- 15%, depending on the benchmark.
2021-07-01 10:41:56 -05:00
Ed Page
bbbf985777 perf(dict): Switch varcon to a burst-trie
This cuts varcon lookup times in half but I still suspect slower than
phf.  Like with bsearch and unlike, the cost is consistent between hits
and misses.

At least this doesn't have the compile hit of PHF + unicase.  Maybe I
should experiment with integrating a non-const-fn variant of unicase
with PHF and give up on all of this extra complexity.
2021-06-30 21:03:57 -05:00
Ed Page
908f9d44eb refactor(dict): Be more cache concious 2021-06-30 19:56:03 -05:00
Ed Page
f176055834 refactor(dict): Make room for trie logic 2021-06-30 19:56:03 -05:00
Ed Page
a1e95bc7c0 refactor(dict): Pull out table-lookup logic
Before, only some dicts did we guarentee were pre-sorted.  Now, all are
for-sure pre-sorted.

This also gives each dict the size-check to avoid lookup.

But this is really about refactoring in prep for playing with other
lookup options, like tries.
2021-06-30 10:12:17 -05:00
Ed Page
bfa7888f82 chore: Skip more releases 2021-06-29 15:39:28 -05:00
Ed Page
9149c4765d chore: Release 2021-06-29 15:05:18 -05:00
Ed Page
c83f655109 feat(parser): Ignore URLs
Fixes #288
2021-06-29 14:14:58 -05:00
Ed Page
b673b81146 fix(parser): Ensure we get full base64
We greedily matched separators, including ones that might be part of
base64.  This impacts the length calculation, so we want as much as
possible.
2021-06-29 13:55:46 -05:00
Ed Page
6915d85c0b feat(parser): Ignore emails
This skips a lot of validation for being "good enough" (comment
open/closes matching, etc).

This has a chance of incorrectly matching in languages with `@` as an
operator, like Python, but Python encourages spaces arround operators,
so hopefully this won't be a problem.
2021-06-29 13:42:27 -05:00
Ed Page
2a1e6ca0f6 feat(parser): Ignore base64
For now, we hardcoded a min length of 90 bytes to ensure to avoid
ambiguity with math operations on variables (generally people use
whitespace anyways).

Fixes #287
2021-06-29 13:25:10 -05:00
Ed Page
23b6ad5796 feat(parser): Ignore SHA-1+
Fixes #270
2021-06-29 12:20:08 -05:00
Ed Page
8566b31f7b fix(parser): Go ahead and do lower UUIDs
I need this for hash support anyways
2021-06-29 12:13:21 -05:00
Ed Page
85082cdbb1 feat(parser): Ignore UUIDs
We might be able to make this bail our earlier and not accidentally
detect the wrong thing by checking if the hex values are lowercase.  RFC
4122 says that UUIDs must be generated lowecase, while input accepts
any case.  The main issues are risk on the "input" part and the extra
annoyance of writing a custm `is_hex_digit` function.
2021-06-29 12:11:50 -05:00
Ed Page
32f5e6c682 refactor(typos)!: Bake ignores into parser
This is prep for other items to be ignored

BREAKING CHANGE: `TokenizerBuilder` no longer takes config for ignoring
tokens.  Related, we now ignore token-ignore config flags.
2021-06-29 11:41:25 -05:00
Ed Page
ded90f2387 perf(parser): Auto-detect unicode
For smaller, ascii-only content, this seems to be taking ~30% less time
for parsing.
2021-06-29 05:28:17 -05:00
Ed Page
95417f3a41 refactor(parser): Consolidate utf8/ascii logic 2021-06-29 05:10:02 -05:00
Ed Page
83b2804623 fix(ci): Don't fail codegen checks 2021-06-28 14:06:47 -05:00
Ed Page
4066d21790 style: Address clippy 2021-06-28 13:51:06 -05:00
Ed Page
3a4d039c4f chore: Reduce code-gen memory usage
More `const fn` removals to reduce compilation memory use
2021-06-07 08:58:34 -05:00
Ed Page
04f5d40e57 chore: Release 2021-06-05 14:39:37 -05:00
Ed Page
2b1f565eaa refactor(varcon): Remove reliance on const-fn
This dropped RSS (memory usage) from 4GB to 1.5GB when compiling.

The extra `match` could impact performance but not too concerned since
the default is to not look within vars.
2021-06-04 15:01:08 -05:00
Ed Page
b1cf03c7eb refactor(varcon): Move away from PHF
This is mostly to give implementation flexibility for changing out how
we store the data to reduce compilation memory usage.

This does have performance impact, jumping from ~220ns to ~320ns for a
dict lookup, according to our micro benchmarks.
2021-06-04 14:59:46 -05:00
Ed Page
1cb9b37120 chore: Update codespell dict
Based on 2ed354c at https://github.com/codespell-project/codespell
2021-05-22 21:44:56 -05:00
Ed Page
3e66a99674 chore: Release 2021-05-21 20:41:02 -05:00
Ed Page
3995745362 chore: Release 2021-05-21 20:39:12 -05:00
Ed Page
b99f32dea8 perf(dict): Bypass vars when possible
Variant support slows us down by 10-50$.  I assume most people will run
with `en` and so most of this overhead is to waste.  So instead of
merging vars with dict, let's instead get a quick win by just skipping
vars when we don't need to.  If the assumptions behind this change over
time or if there is need for speeding up a specific locale, we can
re-address this.

Before:
```
check_file/Typos/code   time:   [35.860 us 36.021 us 36.187 us]
                        thrpt:  [8.0117 MiB/s 8.0486 MiB/s 8.0846 MiB/s]
check_file/Typos/corpus time:   [26.966 ms 27.215 ms 27.521 ms]
                        thrpt:  [21.127 MiB/s 21.365 MiB/s 21.562 MiB/s]
```
After:
```
check_file/Typos/code   time:   [33.837 us 33.928 us 34.031 us]
                        thrpt:  [8.5191 MiB/s 8.5452 MiB/s 8.5680 MiB/s]
check_file/Typos/corpus time:   [17.521 ms 17.620 ms 17.730 ms]
                        thrpt:  [32.794 MiB/s 32.999 MiB/s 33.184 MiB/s]
```

This puts us inline with `--no-default-features --features dict`

Fixes #253
2021-05-19 13:55:41 -05:00
Ed Page
639e65b88a fix(dict): Handle cases from Linux
These were found while running `typos` on Linux and inspecting a
sampling of the results.  #249 represents additional changes to make.
There were some identifiers, that looked like hardware registers, that
I'm unsure of what can be done for them.
2021-05-18 12:02:03 -05:00
Ed Page
fb0dac4297 refactor(dict): Allow 0..n corrections in BuiltIn
The main use case is taking `ther` -> `there` and adding `the` and
`their`.
2021-05-18 12:02:03 -05:00
Ed Page
77cfccb392 refactor(varcon): Clarify check's meanings 2021-05-15 19:29:27 -05:00
Ed Page
b830872ad0 chore: Update enumflags2 2021-05-13 10:20:15 -05:00
Ed Page
7c803681c4 chore: Release 2021-05-13 09:58:09 -05:00
Ed Page
3b9061dece
Merge pull request #240 from crate-ci/dependabot/cargo/codegenrs-1.0.0
chore(deps): Bump codegenrs from 0.1.5 to 1.0.0
2021-05-01 09:04:51 -05:00
dependabot[bot]
d72fa7acba
chore(deps): Bump codegenrs from 0.1.5 to 1.0.0
Bumps [codegenrs](https://github.com/crate-ci/codegenrs) from 0.1.5 to 1.0.0.
- [Release notes](https://github.com/crate-ci/codegenrs/releases)
- [Changelog](https://github.com/crate-ci/codegenrs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/crate-ci/codegenrs/compare/v0.1.5...v1.0.0)

Signed-off-by: dependabot[bot] <support@github.com>
2021-05-01 07:01:59 +00:00
Ed Page
6216fa0837 fix(dict)!: Clarify word sizes with Ranges
The code was generated with separate min / max, rather than using a
Range and ensuring the API is used correctly.
2021-04-30 21:33:33 -05:00
Ed Page
f40ed5a328 style: Address clippy 2021-04-30 11:37:16 -05:00
Ed Page
517da7ecd2 perf(parser): Allow people to bypass unicode cost 2021-04-29 21:07:59 -05:00
Ed Page
09d2124d0f perf(parser): Limit inner-loop assers 2021-04-29 18:31:05 -05:00
Ed Page
287c4cbfe9 refactor(parser): Give more impl flexibility 2021-04-29 18:31:05 -05:00
Ed Page
9cbc7410a4 fix(parser)!: Defer to Unicode XID for identifiers
This saves us from having to have configuration for every detail.  If
people need more control, we can offer it later.

Fixes #225
2021-04-29 18:30:57 -05:00
Ed Page
f15cc58f71 fix(parser): Flip leading digits to work correctly 2021-04-29 18:30:14 -05:00
Ed Page
4b94352b7a perf(parser): Try hand-rolled number parsing 2021-04-29 18:30:14 -05:00
Ed Page
6b92e345cc perf(parser): Speed up UTF-8 validation 2021-04-27 21:17:46 -05:00
Ed Page
819702c82f refactor(parser): Unify str/bytes code paths
The main goal is to support replacing the parser with `nom` where I need
access to `str` only functionality.

With crates like simdutf8, this might also offer up performance gains
since they see the biggest benefit when doing large blocks of
validation.
2021-04-27 21:17:43 -05:00
Ed Page
fce11d6c35 refactor(parser)!: Allow short-circuiting word splitting
This is prep for experiments with getting this information ahead of
time.

See #224
2021-04-27 21:17:38 -05:00
Ed Page
9bfb506c6d fix(typos)!: Clarify Case::Uppers name
`Scream` was referrin to `SCREAMING_CASE` but outside of that context, I
think `Upper` is more accurate.
2021-04-21 20:36:35 -05:00
Ed Page
1f4c587692 chore({{crate_name}}): Release {{version}} 2021-04-14 19:13:25 -05:00
Ed Page
b4459bef33 chore: Fix readme paths in Cargo.toml 2021-04-13 21:36:47 -05:00
Ed Page
d7978658d4 test(cli): Ensure we apply corrections 2021-04-10 19:13:48 -05:00
Ed Page
b5f606f201 refactor(typos): Simplify the top-level API 2021-03-01 11:50:23 -06:00