3:1
soundex: Soundex Hashing
1 Introduction
The soundex package provides an implementation in Racket of the Soundex indexing
hash function as specified somewhat loosely by US National Archives and
Records Administration (NARA) publication [Soundex], and verified
empirically against test cases from various sources. Both the current
NARA function and the older version with different handling of ‘H’ and
‘W’ are supported.
[GIL-55] US National Archives and Records Administration, “Using
the Census Soundex,” General Information Leaflet 55, 1995.
[Soundex] US National Archives and Records Administration, “The
Soundex Indexing System,” 2000-02-19.
Additionally, a nonstandard prefix-guessing function that is an invention
of this package permits additional Soundex keys to be generated from a
string,increasing recall.
2 Characters, Ordinals, and Codes
To facilitate possible future support of other input character sets, this
library employs a character ordinal abstract representation of the letters used by Soundex. The ordinal
value is an integer from 0 to 25—corresponding to the 26 letters ‘A’
through ‘Z’, respectively—and can be used for fast mapping via vectors.
Most applications need not be aware of this.
(soundex-ordinal? x) → boolean?
|
x : any/c |
Predicate for whether or not x is a Soundex ordinal.
(soundex-ordinal chr) → (or/c soundex-ordinal? #f)
|
chr : char? |
Yields the Soundex ordinal value of character chr, or #f if the character is not considered a letter.
> (soundex-ordinal #\a) |
0 |
> (soundex-ordinal #\A) |
0 |
> (soundex-ordinal #\Z) |
25 |
> (soundex-ordinal #\3) |
#f |
> (soundex-ordinal #\.) |
#f |
(soundex-ordinal->char ord) → char?
|
ord : soundex-ordinal? |
Yields the upper-case letter character that corresponds to the character
ordinal value ord. For example:
> (soundex-ordinal->char (soundex-ordinal #\a)) |
#\A |
Note that a #f value as a result of applying soundex-ordinal is not an ordinal value, and is not mapped to a character by soundex-ordinal->char. For example:
> (soundex-ordinal->char (soundex-ordinal #\')) |
soundex-ordinal->char: contract violation, expected: soundex-ordinal?, given: #f |
(soundex-code? x) → boolean?
|
x : any/c |
Predicate for whether or not x is a Soundex code.
(soundex-ordinal->soundex-code ord) → soundex-code?
|
ord : soundex-ordinal? |
Yields a library-specific Soundex code for character ordinal ord.
> (soundex-ordinal->soundex-code (soundex-ordinal #\a)) |
aeiou |
> (soundex-ordinal->soundex-code (soundex-ordinal #\c)) |
#\2 |
> (soundex-ordinal->soundex-code (soundex-ordinal #\N)) |
#\5 |
> (soundex-ordinal->soundex-code (soundex-ordinal #\w)) |
hw |
> (soundex-ordinal->soundex-code (soundex-ordinal #\y)) |
y |
(char->soundex-code chr) → soundex-code?
|
chr : char? |
Yields a library-specific Soundex code for character chr. This is equivalent to:
(soundex-ordinal->soundex-code (soundex-ordinal chr))
3 Hashing
Soundex hashes of strings can be generated with soundex-nara, soundex-old, and soundex.
(soundex-nara str) → string?
|
str : string? |
(soundex-old str) → string? |
str : string? |
(soundex str) → string? |
str : string? |
Yields a Soundex hash key of string str, or #f if not even an initial letter could be found. soundex-nara generates NARA hashes,and soundex-old generates older-style hashes. soundex is an alias for soundex-nara.
> (soundex-nara "Ashcraft") |
"A261" |
> (soundex-old "Ashcraft") |
"A226" |
> (soundex "Ashcraft") |
"A261" |
4 Prefixing
Multiple Soundex hashes from a single string can be generated by soundex-nara/prefixing, soundex-old/prefixing, and soundex/p, which consider the string with and without various common surname prefixes.
(soundex-prefix-starts str)
|
→ (listof exact-nonnegative-integer?) |
str : string? |
Yields a list of Soundex start points in string str, as character index integers, for making hash keys with and without
prefixes. A prefix must be followed by at least two letters, although
they can be interspersed with non-letter characters. The exact behavior
of this function is subject to change in future versions of this library.
> (soundex-prefix-starts "Smith") |
(0) |
> (soundex-prefix-starts " Jones") |
(2) |
> (soundex-prefix-starts "vanderlinden") |
(0 3 6) |
> (soundex-prefix-starts "van der linden") |
(0 3 7) |
> (soundex-prefix-starts "") |
() |
> (soundex-prefix-starts "123") |
() |
> (soundex-prefix-starts "dea") |
(0) |
> (soundex-prefix-starts "dea ") |
(0) |
> (soundex-prefix-starts "dean") |
(0) |
> (soundex-prefix-starts "delasol") |
(0 2 3 4) |
(soundex-nara/prefixing str) → (listof string?)
|
str : string? |
(soundex-old/prefixing str) → (listof string?) |
str : string? |
(soundex/p str) → (listof string?) |
str : string? |
Yields a list of zero or more Soundex hash keys from string str based on the whole string and the string with various prefixes skipped.
All elements of the list are mutually unique. soundex-nara/prefixing generates NARA hashes, and soundex-old/prefixing generates older-style hashes. soundex/p is an alias for soundex-nara/prefixing.
> (soundex/p "Van Damme") |
("V535" "D500") |
> (soundex/p "vanvoom") |
("V515" "V500") |
> (soundex/p "vanvanvan") |
("V515") |
> (soundex/p "DeLaSol") |
("D424" "L240" "A240" "S400") |
5 History
Version 3:1 — 2016-03-02
Version 3:0 — 2016-02-26
Version 2:0 — 2012-06-12
Converted to McFly and Overeasy. Added contracts. Changed
references from Scheme to Racket.
Version 0.6 — Version 1:3 — 2009-03-14
Documentation fix.
Version 0.5 — Version 1:2 — 2009-02-24
Ahem.
Version 0.4 — Version 1:1 — 2009-02-24
Removed internal-use-only procedures from documentation.
Version 0.3 — Version 1:0 — 2009-02-24
Licensed under LGPL 3. Converted to author’s new Scheme
administration system. Made test suite executable. Minor
documentation changes.
Version 0.2 — 2004-08-02
Minor documentation change. Version frozen for PLaneT
packaging.
Version 0.1 — 2004-05-10
First release.
6 Legal
Copyright 2004, 2009, 2012, 2016 Neil Van Dyke. This program is Free
Software; you can redistribute it and/or modify it under the terms of the GNU
Lesser General Public License as published by the Free Software Foundation;
either version 3 of the License, or (at your option) any later version. This
program is distributed in the hope that it will be useful, but without any
warranty; without even the implied warranty of merchantability or fitness for a
particular purpose. See http://www.gnu.org/licenses/ for details. For other
licenses and consulting, please contact the author.