If escape sequences follow straight after each other, there is no
delimiter in-between.
In such a case, parsing previously stopped and did not find any
typos further in the file.
By experimentation (see ticket), it seems that same-case hexadecimal
strings of 32 characters or longer are almost never intended to hold
text. By treating such strings as ignored, we can resist a larger
category of false positives.
Closes#326.
Since our goal is 100% confidence in the results, its better to not
check words than to correct the wrong words.
With that in mind, we'll ignore words after what might be c-escape
sequences (`\nfoo`) or printf substitutions (`%dfoo`).
Fixes#3
This skips a lot of validation for being "good enough" (comment
open/closes matching, etc).
This has a chance of incorrectly matching in languages with `@` as an
operator, like Python, but Python encourages spaces arround operators,
so hopefully this won't be a problem.
For now, we hardcoded a min length of 90 bytes to ensure to avoid
ambiguity with math operations on variables (generally people use
whitespace anyways).
Fixes#287
We might be able to make this bail our earlier and not accidentally
detect the wrong thing by checking if the hex values are lowercase. RFC
4122 says that UUIDs must be generated lowecase, while input accepts
any case. The main issues are risk on the "input" part and the extra
annoyance of writing a custm `is_hex_digit` function.
This is prep for other items to be ignored
BREAKING CHANGE: `TokenizerBuilder` no longer takes config for ignoring
tokens. Related, we now ignore token-ignore config flags.
The main goal is to support replacing the parser with `nom` where I need
access to `str` only functionality.
With crates like simdutf8, this might also offer up performance gains
since they see the biggest benefit when doing large blocks of
validation.