mirror of
https://github.com/crate-ci/typos.git
synced 2024-12-23 16:12:25 -05:00
481 lines
20 KiB
Text
Vendored
481 lines
20 KiB
Text
Vendored
Variant Conversion Info (VarCon)
|
|
|
|
Version 2019.10.06
|
|
|
|
Copyright 2000-2016 by Kevin Atkinson (kevina@gnu.org) and Benjamin
|
|
Titze (btitze@protonmail.ch).
|
|
|
|
This package contains information to convert between American,
|
|
British, Canadian, and Australian spellings and vocabulary as well as
|
|
other variant information.
|
|
|
|
The latest version can be found at http://wordlist.aspell.net/.
|
|
|
|
The main data file is varcon.txt. It contains information on the
|
|
preferred American, British, and Canadian spelling of a word as well
|
|
as other variant information.
|
|
|
|
Each line contains a mapping between the various spellings of a word.
|
|
Words are tagged to indicate where the spelling is used, and each
|
|
word/tag pair is separated with a " / ". For example in the line:
|
|
A Cv: acknowledgment / Av B C: acknowledgement
|
|
"acknowledgment" and "acknowledgement" are two spellings of the same
|
|
word and "A", "Cv", "B", etc are the tags. Tags are separated by
|
|
spaces and the group of tags is separated from the word with a ": ".
|
|
Here, "acknowledgment" is the preferred American spelling (as
|
|
indicated by the "A") of the word, and "acknowledgement" is the
|
|
preferred Canadian and British spelling ("B" and "C"). However the
|
|
American spelling is sometimes used in Canada (as indicated by "Cv",
|
|
where the lowercase "v" indicated a variant form) and the British
|
|
spelling is sometimes used in America (as indicated the "Av").
|
|
|
|
More generally each tag consists of a spelling category (for example
|
|
"A") followed possible by a variant indicator. The spelling
|
|
categories are as follows:
|
|
A: American
|
|
B: British "ise" spelling
|
|
Z: British "ize" spelling or OED preferred Spelling
|
|
C: Canadian
|
|
D: Australian
|
|
_: Other (Variant info based on American dictionaries, never used
|
|
with any of the above).
|
|
and the variants tags are as follows:
|
|
.: equal
|
|
v: variant
|
|
V: seldom used variant
|
|
-: possible variant, should generally not used
|
|
x: improper variant (should not use)
|
|
|
|
The "." or equal variant tags are reserved for special cases when
|
|
there is little agreement between dictionaries or when I think the
|
|
dictionary is wrong. The "v" indicator is used for most words marked
|
|
as variants in the dictionary. However, some variants will be demoted
|
|
to a "V". For example, if the variant is marked as "also" by
|
|
Merriam-Webster, or also if only some dictionaries acknowledge the
|
|
existence the variant. "-" is used when the variant is generally not
|
|
listed is the dictionary but I could find some evidence of its use, or
|
|
when it is marked as an archaic spelling for the word. The "x"
|
|
is used when the spelling is almost generally considered a
|
|
misspelling, and is only included for completeness.
|
|
|
|
For Australian English "v" was used for variants that are widely used,
|
|
but not preferred, and "V" for all "-or" (vs. "-our") variants and
|
|
variants considered "chiefly US".
|
|
|
|
If there are no tags with the 'Z' spelling category on the line then
|
|
'B' implies 'Z'. Similarly if there are no 'C' tags then 'Z' implies
|
|
'C'. If there are no 'D' tags then 'B' implies 'D'.
|
|
|
|
For ease of reading and maintaining the data file, each line is
|
|
grouped in a cluster of closely related words. Each cluster is
|
|
uniquely identified by a headword, which is generally the American
|
|
spelling of word on the first line of the cluster. Each cluster is
|
|
started with a '#' and is followed by the headword with some
|
|
additional information after it. For example the cluster for
|
|
acknowledgment is:
|
|
# acknowledgment <verified> (level 35)
|
|
A Cv: acknowledgment / Av B C: acknowledgement
|
|
A Cv: acknowledgments / Av B C: acknowledgements
|
|
A Cv: acknowledgment's / Av B C: acknowledgement's
|
|
The "<verified>" tag will be explained latter, and "(level 35)"
|
|
indicate what level in SCOWL (see http://wordlist.sourceforge.net)
|
|
the headword is found in. The levels generally mean the following:
|
|
<= 35: Very common word
|
|
<= 70: Can be found in the dictionary
|
|
80: Likely a valid word, can likely be found in an
|
|
unabridged dictionary
|
|
> 80: May not even be a legal word
|
|
|
|
Sometimes the spelling of a word depends on the usage. If so the word
|
|
is listed more than once within a cluster, with any usage information
|
|
being indicated after a " | ". For example here is part of the cluster
|
|
for prize:
|
|
A B: prize | reward
|
|
A B: prizes | reward
|
|
A C: prize / B: prise | otherwise
|
|
A C: prizes / B: prises | otherwise
|
|
which indicated than the preferred spelling of prize is always with a
|
|
"z" when meaning a reward, but otherwise is spelled with a "s" is
|
|
British English. In the example above a brief definition of the word
|
|
is given, but often no such attempt is made, and the definition simply
|
|
consists of a number, for example:
|
|
A B: sake | :1
|
|
A C: sake / Av B Cv: saki | :2
|
|
|
|
Sometimes part-of-speech (POS) info is given to help distinguish which
|
|
form is used. For example:
|
|
A B C: practice / AV Cv: practise | <N>
|
|
A Cv: practice / AV B C: practise | <V>
|
|
POS info is always given in the form "<POS>" and if a definition
|
|
is also given the POS info is always first. The POS tags used are as
|
|
follows:
|
|
<N>: Noun
|
|
<V>: Verb
|
|
<Adj>: Adjective
|
|
<Adv>: Adverb
|
|
|
|
A "(-)" before the definition indicated a rarely used or archaic form
|
|
of a word, for example:
|
|
A B: bark | :1
|
|
A: bark / Av B: barque | (-) ship
|
|
|
|
A "--" indicates a note rather than definition. This is generally
|
|
used to indicate that the spelling of the plural form not depend on
|
|
the spelling of the root word, for example:
|
|
_: cabby / _.: cabbie
|
|
_: cabbies | -- plural
|
|
|
|
Misc. notes on a particular form of a word are given after a "#" on
|
|
the same line. Misc. notes for the cluster are given at the end of
|
|
the cluster and are prefixed with "##", for example:
|
|
# coloration <verified> (level 50)
|
|
A B C: coloration / B. Cv: colouration
|
|
A B C: colorations / B. Cv: colourations
|
|
A B C: coloration's / B. Cv: colouration's
|
|
## OED has coloration as the preferred spelling and discolouration as a
|
|
## variant for British Engl or some reason
|
|
In the notes ODE (not to be confused with OED) stands for Oxford
|
|
Dictionary of English, "Ox" is used for any Oxford dictionary, and
|
|
"M-W" for Merriam-Webster.
|
|
|
|
Earlier versions of varcon contained numerous errors. With version
|
|
5.0 massive effort has been made to correct many of these errors.
|
|
Clusters that have undergone some form of verification (and likely
|
|
correction) are marked with "<verified>". As of version 5.0, most
|
|
clusters with headwords word in common usage (SCOWL level 35 and
|
|
below) should now be checked, as well as many others. No effort was
|
|
made to check clusters with headwords in SCOWL level 80 and above;
|
|
many of those entries are unlikely to be in the dictionary anyway.
|
|
|
|
The file variant-also.tab contains additional mappings between various
|
|
spellings of a word which are not yet in varcon.txt. No attempt is
|
|
made to distinguish the primary form of a word. The file
|
|
variant-infl.tab is like variant-also.tab except that it is created
|
|
automatically from the AGID inflection database. The file
|
|
variant-wroot.tab is like variant-infl.tab except that it also
|
|
included the root form of the word.
|
|
|
|
The file voc.tab is similar to varcon.txt but converts between
|
|
vocabulary instead of spelling. Unlike varcon.tab it is a simple tab
|
|
separated file with the fields corresponding to the American, British,
|
|
and Canadian words. If more than one word if often used to describe
|
|
the same thing the words are separated with commas. The last column
|
|
contains additional notes on when the word is used. Unlike varcon.txt
|
|
it is generally not suitable for automatic conversion.
|
|
|
|
The "make-variant" Perl script will combine varcon.txt,
|
|
variant-also.tab, and variant-infl.tab into one huge mapping and will
|
|
output the result to "variant.tab". If the "no-infl" option is given
|
|
than variant-infl.tab will not be included.
|
|
|
|
The "split" script will split out the information in varcon.txt into
|
|
several word lists named as follows:
|
|
<spelling>[-v<variant level>][-uncommon].lst
|
|
where <spelling> is one of: american, british, british_z, canadian,
|
|
common, or other. "common" is used for words which appear in
|
|
varcon.txt, yet are used in all versions of english, such as "prize",
|
|
and "other" is used for the "_" spelling category. The mapping from
|
|
the variant indicators in varcon.txt to the numeric variant level is
|
|
as follows:
|
|
v => 0
|
|
V => 1
|
|
- => 2
|
|
"-uncommon" is used for forms marked with "(-)" as already described.
|
|
|
|
The "translate" Perl script will translate a text file from one
|
|
spelling to another. Its usage is:
|
|
|
|
translate <options> [<translation array>] <from> <to>
|
|
<options> is any of
|
|
-?,-h,--help this screen
|
|
-m,--mark mark words where the translation is questionable
|
|
-i,--include include words where the translation is questionable
|
|
<translation array> is the file name of the translation array,
|
|
defaults to "abbc.tab".
|
|
<from> and <to> are one of: american, british, british_z, or canadian.
|
|
british-ise and british-ize can also be used.
|
|
|
|
Text is read in from standard input and is outputted to standard out.
|
|
Words are marked with a '?' before and after the questionable word
|
|
when the option is enabled.
|
|
|
|
The file varcon.pm contains some library routines for parsing
|
|
varcon.txt and is used by many of the scripts above.
|
|
|
|
If you discover any errors in these mappings or have suggestions for
|
|
additions please file a bug report at
|
|
https://github.com/kevina/wordlist/issues, or alternatively email me
|
|
directly at kevina@gnu.org, but I will likely tell you to file a bug
|
|
report so that I don't forget about it.
|
|
|
|
SOURCE:
|
|
|
|
These mappings were compiled from numerous sources.
|
|
|
|
The abc.tab was originally created from the American and British word
|
|
lists found in the Ispell distribution and the Canadian word list
|
|
created by Garst R. Reese <reese@isn.net>:
|
|
|
|
What I have discovered is that Canadian is a modification of British.
|
|
Canadians use ize ization, izing izable like Americans, and gram instead
|
|
of gramme. The one exception I found was practise. It does not go to
|
|
practize. Otherwise they use British spelling. So, what I am currently
|
|
checking books with is a an edited version of British, where I have
|
|
changed all occurrences of ise to ize, isab to izab, isation to ization,
|
|
ising to izing, and gramme to gram except I allow programme, which is
|
|
sometimes proper unless you are talking about a computer program. I did
|
|
bunches of greps to be sure these substitutions would work as expected.
|
|
|
|
Many other words have been added to abc.tab which were not in the
|
|
original Ispell word lists.
|
|
|
|
Many different web sources were consulted when crating the tables. They
|
|
include:
|
|
|
|
The American-British British-American Dictionary
|
|
http://www.peak.org/~jeremy/dictionary/dictionary.html
|
|
American and British Spelling Differences
|
|
http://www.peak.org/~jeremy/dictionary/spellcat.html
|
|
Dave (VE7CNV)'s Truly Canadian Dictionary of Canadian Spelling
|
|
http://www.luther.bc.ca/~dave7cnv/cdnspelling/cdnspelling.html
|
|
Canadian Spelling Convention
|
|
http://imej.wfu.edu/articles/1999/1/02/demo/tutorial/canas.html
|
|
Cornerstone's Canadian English Page
|
|
http://www.web.net/cornerstone/cdneng.htm
|
|
Inter-Play Translation: British/Canadian/American Spelling
|
|
http://www.inter-play.com/translation/spel-ukus.htm
|
|
Inter-Play Translation: British/Canadian/American Vocabulary
|
|
http://www.inter-play.com/translation/voc-ukus.htm
|
|
|
|
As well as several online dictionaries:
|
|
|
|
Marriam-Webster: http://www.m-w.com/
|
|
American Heritage: http://www.bartleby.com/61/
|
|
Cambridge (ESL): http://dictionary.cambridge.org/
|
|
|
|
In version 5.0 a massive effort to correct the numerous errors in
|
|
VarCon was done. The primary sources used for verification were:
|
|
|
|
Marriam-Webster: http://www.m-w.com/
|
|
Free version of Oxford Dictionaries Online:
|
|
http://www.oxforddictionaries.com/
|
|
Oxford dictionaries available via Oxford Reference Online
|
|
(subscription service, http://www.oxfordreference.com/):
|
|
The New Oxford American Dictionary (2nd edition, 2006)
|
|
and sometimes: The Oxford American Dictionary of Current English (2002)
|
|
The Concise Oxford English Dictionary (11th edition revised, 2008)
|
|
and sometimes: The Oxford Dictionary of English (2nd edition revised, 2005)
|
|
The Canadian Oxford Dictionary (2004)
|
|
|
|
I also used Tysto UK vs US spelling list available at:
|
|
http://www.tysto.com/articles05/q1/20050324uk-us.shtml
|
|
to make sure I didn't leave out any information in VarCon, however any
|
|
additions from his lists where verified using the dictionaries
|
|
mentioned above as his lists contained numerous errors (such as
|
|
including archaic spellings of words)
|
|
|
|
I also made indirect use of Luke's Canadian, British and American
|
|
Spelling page available at:
|
|
http://www.lukemastin.com/testing/spelling/cgi-bin/database.cgi?database=spelling
|
|
but only to perform some initial verification, in the end I rechecked
|
|
his data using the dictionaries above. (However, his data is, by far,
|
|
more accurate than Tysto's)
|
|
|
|
In Version 2016.11.20 Benjamin Titze added support for Australian English.
|
|
The primary sources for this addition were:
|
|
|
|
The Macquarie Dictionary: https://www.macquariedictionary.com.au/
|
|
Style Manual: For Authors, Editors and Printers, 6th Edition. DCITA.
|
|
University of Technology Sydney Publications Style Guide:
|
|
http://www.gsu.uts.edu.au/publications/styleguide/spelling.html
|
|
Style Manual, Department of Treasury and Finance, Tasmania:
|
|
http://conference.tasa.org.au/wp-content/uploads/2015/03/Style-Manual.pdf
|
|
Editor Australia - Style Guide:
|
|
http://www.editoraustralia.com/styleguide_spelling.html
|
|
Webster in Australia (history of "our"/"or" spelling variants):
|
|
http://blogs.usyd.edu.au/elac/2008/01/webster_in_australia.html
|
|
|
|
|
|
CHANGELOG:
|
|
|
|
From 2017.08.24 to 2018.10.06
|
|
|
|
- Added entries for: eukaryote, prokaryote, virtualization, volcanism
|
|
|
|
From 2016.11.20 to 2017.08.24
|
|
|
|
- Typo fixes thanks to Jakub Wilk
|
|
|
|
From 2016.06.26 to 2016.11.20
|
|
|
|
- New Australian spelling category thanks to the work of Benjamin
|
|
Titze.
|
|
|
|
- Various other fixes.
|
|
|
|
From 2016.01.19 to 2016.06.26
|
|
|
|
- Fix plural of "bus".
|
|
|
|
From 2015.08.24 to 2016.01.19
|
|
|
|
- Undo the effects of PERL_UNICODE in the translate script.
|
|
|
|
- Other minor fixes and new entries.
|
|
|
|
From 2014.02.15 to 2015.08.24 (Aug 24, 2015)
|
|
|
|
- Added entry for Koran/Koranic.
|
|
|
|
- Tweaked "adviser" cluster.
|
|
|
|
- Fix formatting problems.
|
|
|
|
From 2015.01.28 to 2014.02.15 (February 15, 2015)
|
|
|
|
- Various new entries
|
|
|
|
From 2014.11.17 to 2015.01.28 (January 28, 2015)
|
|
|
|
- Minor adjustments to a few entries (analytic, amid)
|
|
|
|
- Added entry for shareable
|
|
|
|
- Remove a junk entry (ted/taed).
|
|
|
|
From 2014.08.11 to 2014.11.17 (November 17, 2014)
|
|
|
|
- Fix typos in README
|
|
|
|
- Enhancement to VarCon translate script. It will now, by default,
|
|
filter clusters with a SCOWL level > 80. This behavior can be
|
|
controlled with the new "--thresh" option.
|
|
|
|
- Remove a few junk entries.
|
|
|
|
From Revision 5.1 to Version 2014.08.11 (August 8, 2014)
|
|
|
|
- Various corrections. Most of them minor. Two notable exceptions:
|
|
|
|
- Added an entry for furor as the correct British spelling is furore
|
|
|
|
- Fixed racket entries as Canadians still use racquet even
|
|
though it is a British English (at least according to the
|
|
Oxford dictionaries)
|
|
|
|
- Other minor changes.
|
|
|
|
From Revision 5.0 to Revision 5.1 (January 6, 2010)
|
|
|
|
- Corrected numerous errors after running various forms
|
|
of verification on varcon.txt.
|
|
|
|
- Reordered the clusters in varcon.txt so that they are
|
|
mostly in alphabetic order based on the headword.
|
|
|
|
From Revision 4.1 to Revision 5.0 (December 27, 2010)
|
|
|
|
- Completely new format for the main table which, in addition to
|
|
providing the preferred spelling of a word for various forms of
|
|
English, also records variant and other information. To reflect
|
|
this change, the name of the file was renamed from abbc.tab to
|
|
varcon.txt.
|
|
|
|
- Massive effort to verify the variant information against
|
|
authoritative sources (mainly Oxford dictionaries). Most entries
|
|
for common words (SCOWL level 35 and below) have been checked
|
|
against at least a British and Canadian dictionary.
|
|
|
|
- Added variant information for numerous other words, even when
|
|
there is no difference between the various forms on English.
|
|
|
|
- Other changes corresponding to the new format.
|
|
|
|
From Revision 4 to Revision 4.1 (August 10, 2004)
|
|
|
|
- Fixed various errors in abbc.tab
|
|
|
|
- Removed clause 4 from the Ispell copyright with permission of Geoff
|
|
Kuenning.
|
|
|
|
From Revision 3 to Revision 4 (August 7, 2004)
|
|
|
|
- Added a column to "abc.tab" for the British "ize" spelling and
|
|
renamed the file to abbc.tab.
|
|
- Added verb forms of prize/prise to abbc.tab, removed from
|
|
variant-also.tab
|
|
|
|
From Revision 2 to Revision 3 (January 2, 2003)
|
|
|
|
- Added an option for not including variant-infl.tab for the
|
|
make-variant perl script
|
|
- Added the file variant-wroot.tab
|
|
- Added a few entries given to me by Francis Bond and Edward Betts
|
|
|
|
From Revision 1 to Revision 2 (January 27, 2001)
|
|
|
|
- Removed all "B" markers because I could not find any evidence for
|
|
them
|
|
- Corrected a few Canadian entries, especially those with the "B"
|
|
markers
|
|
- Added some more entries by trying fixed changes (such as ize to
|
|
ise) to words in SCOWL and hand-checking over the ones with semi-common
|
|
words in them.
|
|
- Added variant-infl.tab
|
|
|
|
COPYRIGHT:
|
|
|
|
Copyright 2000-2018 by Kevin Atkinson
|
|
|
|
Permission to use, copy, modify, distribute and sell this array, the
|
|
associated software, and its documentation for any purpose is hereby
|
|
granted without fee, provided that the above copyright notice appears
|
|
in all copies and that both that copyright notice and this permission
|
|
notice appear in supporting documentation. Kevin Atkinson makes no
|
|
representations about the suitability of this array for any
|
|
purpose. It is provided "as is" without express or implied warranty.
|
|
|
|
Copyright 2016 by Benjamin Titze
|
|
|
|
Permission to use, copy, modify, distribute and sell this array, the
|
|
associated software, and its documentation for any purpose is hereby
|
|
granted without fee, provided that the above copyright notice appears
|
|
in all copies and that both that copyright notice and this permission
|
|
notice appear in supporting documentation. Benjamin Titze makes no
|
|
representations about the suitability of this array for any
|
|
purpose. It is provided "as is" without express or implied warranty.
|
|
|
|
Since the original words lists come from the Ispell distribution:
|
|
|
|
Copyright 1993, Geoff Kuenning, Granada Hills, CA
|
|
All rights reserved.
|
|
|
|
Redistribution and use in source and binary forms, with or without
|
|
modification, are permitted provided that the following conditions
|
|
are met:
|
|
|
|
1. Redistributions of source code must retain the above copyright
|
|
notice, this list of conditions and the following disclaimer.
|
|
2. Redistributions in binary form must reproduce the above copyright
|
|
notice, this list of conditions and the following disclaimer in the
|
|
documentation and/or other materials provided with the distribution.
|
|
3. All modifications to the source code must be clearly marked as
|
|
such. Binary redistributions based on modified source code
|
|
must be clearly marked as modified versions in the documentation
|
|
and/or other materials provided with the distribution.
|
|
(clause 4 removed with permission from Geoff Kuenning)
|
|
5. The name of Geoff Kuenning may not be used to endorse or promote
|
|
products derived from this software without specific prior
|
|
written permission.
|
|
|
|
THIS SOFTWARE IS PROVIDED BY GEOFF KUENNING AND CONTRIBUTORS ``AS IS'' AND
|
|
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
ARE DISCLAIMED. IN NO EVENT SHALL GEOFF KUENNING OR CONTRIBUTORS BE LIABLE
|
|
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
SUCH DAMAGE.
|