Commit graph

226 commits

Author SHA1 Message Date
Ed Page
bd5048def5 fix(parser): Allow backslashes after ignore items
To allow `\\` to start a token, we couldn't let it end a token.  By
switching the termiantor to a peek, we can now make it end a token
**and** start a token, allowing us to work better with windows paths.

Fixes #481
2022-05-10 14:02:54 -05:00
Ed Page
1720e7d65e fix(parser): Ignore items at end of input 2022-05-10 13:38:03 -05:00
Ed Page
7e15afe81f test(parser): Add reproduction of #481 2022-05-10 12:58:19 -05:00
Ed Page
4869764f7b test(parser): Remove unclear test case
Unsure why this case is here and it causes difficulties
2022-05-10 12:58:13 -05:00
Ed Page
ad89736832 refactor(parser): Clarify precedence levels 2022-05-10 12:58:08 -05:00
Ed Page
9f623c618b chore: Release 2022-04-28 09:39:14 -05:00
Denis Kasak
29508a689b feat(dict): Add typo identitiy -> identity 2022-04-28 16:24:18 +02:00
Ed Page
dcc3c0b11e chore: Release 2022-04-25 11:49:02 -05:00
Jonas Platte
5f5ef1468d feat(dict): Add 'signign' typo to words.csv 2022-04-25 11:26:08 -05:00
Jonas Platte
bbd71ab434 feat(dict): Add 'unencyrpted' typo to words.csv 2022-04-25 11:25:48 -05:00
SeongChan Lee
4e4f136ec6 Fix tokenizer for uppercase UUID
Microsoft toolchains usually emit UUID/GUID in UPPERCASE
2022-04-25 11:12:25 +09:00
Ed Page
7d3e9bb070 chore: Release 2022-04-18 09:39:53 -05:00
Ed Page
e63659c208 fix: Ignore CSS colors
Fixes #462
2022-04-18 09:19:44 -05:00
Ed Page
9c273c6cfb
Merge pull request #451 from crate-ci/dependabot/cargo/nom-7.1.1
chore(deps): Bump nom from 7.1.0 to 7.1.1
2022-04-01 09:34:31 -05:00
dependabot[bot]
0281c7023e
chore(deps): Bump nom from 7.1.0 to 7.1.1
Bumps [nom](https://github.com/Geal/nom) from 7.1.0 to 7.1.1.
- [Release notes](https://github.com/Geal/nom/releases)
- [Changelog](https://github.com/Geal/nom/blob/main/CHANGELOG.md)
- [Commits](https://github.com/Geal/nom/compare/7.1.0...7.1.1)

---
updated-dependencies:
- dependency-name: nom
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-04-01 07:02:37 +00:00
dependabot[bot]
40080cb01e
chore(deps): Bump once_cell from 1.9.0 to 1.10.0
Bumps [once_cell](https://github.com/matklad/once_cell) from 1.9.0 to 1.10.0.
- [Release notes](https://github.com/matklad/once_cell/releases)
- [Changelog](https://github.com/matklad/once_cell/blob/master/CHANGELOG.md)
- [Commits](https://github.com/matklad/once_cell/compare/v1.9.0...v1.10.0)

---
updated-dependencies:
- dependency-name: once_cell
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-04-01 07:02:26 +00:00
Ed Page
86c54fffbf style: Update clippy 2022-03-29 15:07:19 -05:00
Ed Page
1d16086495 chore: Release 2022-03-09 08:59:49 -06:00
Ed Page
ab61b33572
Merge pull request #443 from crate-ci/dependabot/cargo/unicode-segmentation-1.9.0
chore(deps): Bump unicode-segmentation from 1.8.0 to 1.9.0
2022-03-01 08:30:25 -06:00
dependabot[bot]
a58b735e5e
chore(deps): Bump unicode-segmentation from 1.8.0 to 1.9.0
Bumps [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) from 1.8.0 to 1.9.0.
- [Release notes](https://github.com/unicode-rs/unicode-segmentation/releases)
- [Commits](https://github.com/unicode-rs/unicode-segmentation/commits)

---
updated-dependencies:
- dependency-name: unicode-segmentation
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-03-01 07:03:33 +00:00
dependabot[bot]
f3107c4794
chore(deps): Bump clap from 3.0.13 to 3.1.3
Bumps [clap](https://github.com/clap-rs/clap) from 3.0.13 to 3.1.3.
- [Release notes](https://github.com/clap-rs/clap/releases)
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](https://github.com/clap-rs/clap/compare/v3.0.13...v3.1.3)

---
updated-dependencies:
- dependency-name: clap
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-03-01 07:03:22 +00:00
Ed Page
b686760935 chore: Release 2022-02-14 09:05:09 -06:00
Ed Page
c3bb4adfa1 fix(parser): Allow commas in urls
Got us closer to https://www.ietf.org/rfc/rfc3986.txt

Fixes #433
2022-02-14 08:49:55 -06:00
Ed Page
09203fd592 fix(parser): Recognize URLs with passwords 2022-02-14 08:21:56 -06:00
Ed Page
05773fe815 chore: Release 2022-02-08 07:12:19 -06:00
Sebastian Neubauer
fa5a724cec feat(dict): Add more typos 2022-02-08 13:41:44 +01:00
Ed Page
8ddb09eff3 chore: Update dependencies 2022-02-01 10:34:12 -06:00
dependabot[bot]
a3f39efdc8
chore(deps): Bump clap from 3.0.0 to 3.0.13
Bumps [clap](https://github.com/clap-rs/clap) from 3.0.0 to 3.0.13.
- [Release notes](https://github.com/clap-rs/clap/releases)
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](https://github.com/clap-rs/clap/compare/clap_complete-v3.0.0...v3.0.13)

---
updated-dependencies:
- dependency-name: clap
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-02-01 07:02:33 +00:00
Ed Page
5b7fe620ec chore: Release 2022-01-26 14:32:31 -06:00
Ed Page
a39074fc7f fix(parser): Detect shorter base64 values
This is part of the way to #413.  In that case, they aren't providing
padding though.
2022-01-26 14:18:01 -06:00
Ed Page
2c5f2ecedd chore: Release 2022-01-26 10:01:15 -06:00
Ed Page
3c78d65462 fix(parser): Don't stop on almost-printfs
When we added support for printf interopolation, we had to adjust our
separator matching to not eat the start of printf interpolation.

When doing so, I overlooked the need to still eat it in the catch-all.
If we don't, we then try to read `%` as part of the identifier and bail
out early.

Fixes #411
2022-01-26 09:39:23 -06:00
Ed Page
4b2e66487c chore: Release 2022-01-24 20:35:08 -06:00
Ed Page
0c49c3ea2b fix(parser): Allow markdown formatting around ordinals
Fixes #409
2022-01-24 20:01:06 -06:00
Ed Page
f7fd7c0e42 chore: Release 2022-01-21 10:39:27 -06:00
Ed Page
5598b5b3e9 fix(dict): Workes should also correct to workers
Fixes #402
2022-01-21 10:10:56 -06:00
Ed Page
71b53cb23e chore: Release 2021-12-18 17:52:11 -06:00
Ed Page
5c83dec07b style: Remove unused variable 2021-12-14 15:41:52 -06:00
Ed Page
469a9aedc2 chore: Release 2021-12-14 12:58:03 -06:00
Frank Steffahn
2748d6a148
fix(dict): Typo in Typos (#3870 2021-12-14 12:54:48 -06:00
Ed Page
f99eb040de chore: Update dependencies 2021-12-01 08:05:54 -06:00
Ed Page
3b3a944c93 fix: Detect descrepancy
Found this in the clap code base.
2021-11-24 15:09:01 -06:00
Ed Page
c0e8a2c932 chore: Release 2021-11-16 07:46:33 -06:00
Ed Page
8e29e94060 chore: Update cargo-release 2021-11-16 07:44:08 -06:00
Ed Page
3ca0aed0a7
Merge pull request #374 from Flakebi/fix-escape
Fix multiple escape sequences
2021-11-15 08:18:41 -06:00
Neubauer, Sebastian
3fc6089660 fix: Fix multiple escape sequences
If escape sequences follow straight after each other, there is no
delimiter in-between.
In such a case, parsing previously stopped and did not find any
typos further in the file.
2021-11-15 11:31:53 +01:00
Neubauer, Sebastian
76ec666970 feat(dict): Add more corrections
I encountered these when going through a codebase with another tool.
2021-11-12 23:02:08 +01:00
Ed Page
4f17586d08 chore: Update MSRV 2021-11-08 11:56:01 -06:00
Ed Page
a8ae8a5c26 chore: Update boiletplate 2021-11-08 10:11:02 -06:00
Ed Page
153f570ec9 chore: Release 2021-11-03 11:48:12 -05:00
Ed Page
fcac819478 fix: Address false positives
Hard to say how to handle `doen't` since we don't handle contractions.
For now, I've gone ahead and added corrections to the part of the
contraction.  Hopefully that doesn't confuse people

Part of #362
2021-10-23 08:21:53 -05:00
Ed Page
efae838e5c perf: Remove some function overhead
Unfortunately, almost all of this is for corrections.
2021-09-14 21:09:30 -05:00
Ed Page
3cd24f5cca chore: Release 2021-09-14 10:03:34 -05:00
Ed Page
e20879dae1 fix: Reduce false positives from ordinals
Just ignoring them since our focus is on programmer typos and these
can't be identifiers.  This is simpler and is less work at runtime.

Fixes #331
2021-09-14 08:53:31 -05:00
Ed Page
92e46848a3 chore: Update dependencies 2021-09-01 06:38:52 -05:00
Ed Page
dbea7ab1e0 chore: Release 2021-08-30 09:16:40 -05:00
Ville Skyttä
4fcd7ba16f feat(dict): Suggest surrounded for surrouned too 2021-08-29 21:22:24 +03:00
Nick Mathewson
739d1a2f7c Ignore hexadecimal "hashes" of length 32 or greater.
By experimentation (see ticket), it seems that same-case hexadecimal
strings of 32 characters or longer are almost never intended to hold
text.  By treating such strings as ignored, we can resist a larger
category of false positives.

Closes #326.
2021-08-20 12:34:59 -04:00
Ed Page
613a0cba4b chore: Iterate on release process 2021-08-16 11:23:25 -05:00
mendess
5747aba05d Add instantialed as a typo for instantiated 2021-08-06 14:33:50 +01:00
Ed Page
2dce866937 chore: Release 2021-08-02 09:55:25 -05:00
Ed Page
a5f0dd8ee9 fix(token): Continue parsing on c-escape 2021-08-02 09:29:10 -05:00
Ed Page
3e5d2e0620
Merge pull request #324 from epage/escape
fix(token): Continue parsing on c-escape
2021-08-02 09:23:42 -05:00
Ed Page
fdeba0e71b fix(token): Continue parsing on c-escape 2021-08-02 09:11:54 -05:00
dependabot[bot]
febcee3332
chore(deps): Bump env_logger from 0.8.4 to 0.9.0
Bumps [env_logger](https://github.com/env-logger-rs/env_logger) from 0.8.4 to 0.9.0.
- [Release notes](https://github.com/env-logger-rs/env_logger/releases)
- [Changelog](https://github.com/env-logger-rs/env_logger/blob/main/CHANGELOG.md)
- [Commits](https://github.com/env-logger-rs/env_logger/compare/v0.8.4...v0.9.0)

---
updated-dependencies:
- dependency-name: env_logger
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2021-08-01 07:05:08 +00:00
Ed Page
2304fc6735 chore: Release 2021-07-30 12:12:07 -05:00
Ed Page
9a8d41fcb2 chore: Release 2021-07-30 12:09:59 -05:00
Ed Page
2202b7f661 fix(parser): Handle c-escape/printf
Since our goal is 100% confidence in the results, its better to not
check words than to correct the wrong words.

With that in mind, we'll ignore words after what might be c-escape
sequences (`\nfoo`) or printf substitutions (`%dfoo`).

Fixes #3
2021-07-30 11:30:05 -05:00
Ed Page
3049852bfd fix(dict): Avoid contraction false positive
Fixes #317
2021-07-30 10:42:57 -05:00
Ed Page
f60e798a2a chore: Release 2021-07-27 15:31:01 -05:00
Ed Page
3486c23bdb chore: Release 2021-07-27 15:29:18 -05:00
Ed Page
49459cede7 feat(dict): Add more corrections 2021-07-27 14:53:13 -05:00
Ed Page
6037eebfdc style: Clippy 2021-07-27 14:28:16 -05:00
Ed Page
70fbd63b00 fix: Update dictionary 2021-07-27 14:21:00 -05:00
Ed Page
960471ae23 fix: Prevent old typos from coming back 2021-07-27 14:16:13 -05:00
Ed Page
4e99217896 test: Ensure words are stored lowercase 2021-07-27 14:16:12 -05:00
Ed Page
0008713395 test: Ensure words.csv stays sorted 2021-07-27 14:16:12 -05:00
Ed Page
41048d15b3 test: Prevent correcting corrections 2021-07-27 13:58:57 -05:00
Ed Page
fc4ec0e4a1 fix: Correcting to typos 2021-07-27 13:58:57 -05:00
Ed Page
5b29113ec8 refactor(typos): Remove unused calculations
In #293, we moved where we were filtering out results but never
switched from `filter_map` to map`, so this does that.
2021-07-06 11:08:05 -05:00
Ed Page
7a2a5042a1 refactor(dict): Remove useless entries 2021-07-02 10:24:59 -05:00
Ed Page
4c2f2c434a feat(dict): Shared PHF support 2021-07-01 11:14:30 -05:00
Ed Page
3b43272724 refactor(dict): Separate dictgen concerns 2021-07-01 11:00:33 -05:00
Ed Page
c8d1058a71 refactor(dict): Change typos-dict to trie
This is +/- 15%, depending on the benchmark.
2021-07-01 10:41:56 -05:00
Ed Page
bbbf985777 perf(dict): Switch varcon to a burst-trie
This cuts varcon lookup times in half but I still suspect slower than
phf.  Like with bsearch and unlike, the cost is consistent between hits
and misses.

At least this doesn't have the compile hit of PHF + unicase.  Maybe I
should experiment with integrating a non-const-fn variant of unicase
with PHF and give up on all of this extra complexity.
2021-06-30 21:03:57 -05:00
Ed Page
908f9d44eb refactor(dict): Be more cache concious 2021-06-30 19:56:03 -05:00
Ed Page
f176055834 refactor(dict): Make room for trie logic 2021-06-30 19:56:03 -05:00
Ed Page
a1e95bc7c0 refactor(dict): Pull out table-lookup logic
Before, only some dicts did we guarentee were pre-sorted.  Now, all are
for-sure pre-sorted.

This also gives each dict the size-check to avoid lookup.

But this is really about refactoring in prep for playing with other
lookup options, like tries.
2021-06-30 10:12:17 -05:00
Ed Page
bfa7888f82 chore: Skip more releases 2021-06-29 15:39:28 -05:00
Ed Page
9149c4765d chore: Release 2021-06-29 15:05:18 -05:00
Ed Page
c83f655109 feat(parser): Ignore URLs
Fixes #288
2021-06-29 14:14:58 -05:00
Ed Page
b673b81146 fix(parser): Ensure we get full base64
We greedily matched separators, including ones that might be part of
base64.  This impacts the length calculation, so we want as much as
possible.
2021-06-29 13:55:46 -05:00
Ed Page
6915d85c0b feat(parser): Ignore emails
This skips a lot of validation for being "good enough" (comment
open/closes matching, etc).

This has a chance of incorrectly matching in languages with `@` as an
operator, like Python, but Python encourages spaces arround operators,
so hopefully this won't be a problem.
2021-06-29 13:42:27 -05:00
Ed Page
2a1e6ca0f6 feat(parser): Ignore base64
For now, we hardcoded a min length of 90 bytes to ensure to avoid
ambiguity with math operations on variables (generally people use
whitespace anyways).

Fixes #287
2021-06-29 13:25:10 -05:00
Ed Page
23b6ad5796 feat(parser): Ignore SHA-1+
Fixes #270
2021-06-29 12:20:08 -05:00
Ed Page
8566b31f7b fix(parser): Go ahead and do lower UUIDs
I need this for hash support anyways
2021-06-29 12:13:21 -05:00
Ed Page
85082cdbb1 feat(parser): Ignore UUIDs
We might be able to make this bail our earlier and not accidentally
detect the wrong thing by checking if the hex values are lowercase.  RFC
4122 says that UUIDs must be generated lowecase, while input accepts
any case.  The main issues are risk on the "input" part and the extra
annoyance of writing a custm `is_hex_digit` function.
2021-06-29 12:11:50 -05:00
Ed Page
32f5e6c682 refactor(typos)!: Bake ignores into parser
This is prep for other items to be ignored

BREAKING CHANGE: `TokenizerBuilder` no longer takes config for ignoring
tokens.  Related, we now ignore token-ignore config flags.
2021-06-29 11:41:25 -05:00
Ed Page
ded90f2387 perf(parser): Auto-detect unicode
For smaller, ascii-only content, this seems to be taking ~30% less time
for parsing.
2021-06-29 05:28:17 -05:00
Ed Page
95417f3a41 refactor(parser): Consolidate utf8/ascii logic 2021-06-29 05:10:02 -05:00