In some test data for rinja, they check some parsing corner cases.
Unfortunately for us, also hit a performance corner case.
The entire file was a valid email username but without an `@`.
This mean for every byte, we checked that every byte after it was a
valid username but then backtracked at the end, repeating this until the
whole file was read.
Fixes#1088
Typos primarily works off of identifiers and words. We have built-in
support to detect constructs that span identifiers that should not be
spell checked, like UUIDs, emails, domains, etc. This opens it up for
for user-defined identifier-spanning constructs using regexes via
`extend-ignore-re`.
This works differently than any of the previous ways of ignoring thing
because the regexes require extra parse passes. Under the assumption
that (1) actual typos are rare and (2) number of files relying on
`extend-ignore-re` are rare, we only do these extra parse passes when a
typo is found, causing almost no performance hit in the expected case.
While this could be used for more generic types of ignores, it isn't the
most maintainable because it is separate from the source files in
question. Ideally, we'd implement document settings / directives for
these cases (#316).
Previously, we bailed out if the string is too short (<90) and there
weren't non-alpha-base64 bytes present. What we ignored were the
padding bytes.
We key off of padding bytes to detect that a string is in fact base64
encoded. Like the other cases, there can be false positives but those
strings should show up elsewhere or the compiler will fail.
This was called out in #485
To allow `\\` to start a token, we couldn't let it end a token. By
switching the termiantor to a peek, we can now make it end a token
**and** start a token, allowing us to work better with windows paths.
Fixes#481
When we added support for printf interopolation, we had to adjust our
separator matching to not eat the start of printf interpolation.
When doing so, I overlooked the need to still eat it in the catch-all.
If we don't, we then try to read `%` as part of the identifier and bail
out early.
Fixes#411
If escape sequences follow straight after each other, there is no
delimiter in-between.
In such a case, parsing previously stopped and did not find any
typos further in the file.
By experimentation (see ticket), it seems that same-case hexadecimal
strings of 32 characters or longer are almost never intended to hold
text. By treating such strings as ignored, we can resist a larger
category of false positives.
Closes#326.
Since our goal is 100% confidence in the results, its better to not
check words than to correct the wrong words.
With that in mind, we'll ignore words after what might be c-escape
sequences (`\nfoo`) or printf substitutions (`%dfoo`).
Fixes#3