Phonetic matching functions

The IBM® Netezza® SQL language supports two phonetic matching functions that you can use to encode names into phonetic representations by using the SoundEx NYSIIS or Double Metaphone algorithms.

By encoding names phonetically, you can match names that are based on their pronunciation and reduce misses that might result from spelling variations. The phonetic matching functions are not case-sensitive comparisons; the phonetic representations are the same for two strings that have the same spelling but different letter casing. These functions support VARCHAR and CHAR data types and specifically the ASCII subset of Latin9 encoding only. The functions ignore any characters outside the ASCII subset.

Note: If you use strings that call the phonetic functions with characters beyond the ASCII range, transliterate the strings to convert accented characters to their ASCII unaccented versions.

SoundEx NYSIIS

SQL syntax:
<varchar(6) value> = nysiis(<str_expr>)

Soundex is a well-known phonetic algorithm for indexing names by sound as pronounced in English. This function converts a string into its Soundex representation by using the New York State Identification and Intelligence System (NYSIIS) variation of Soundex. The return value is a string of up to six characters that identifies the pronunciation of the input string. For example, the function nysiis('Washington') returns the string ‘wasang’, while the function nysiis('brown') returns the value ‘bran’.

Double Metaphone

SQL syntax:
<int4 value> = dbl_mp(<str_expr>)

Double Metaphone is another phonetic algorithm for indexing strings by their pronunciation. Similar to Soundex, it uses a different set of rules for English and alternate pronunciation. The function returns two 4-character string encodings, a primary key and secondary (or alternate) key, for pronunciation of the input string. Similar sounding words share keys, though they might be of variable length and spelling. For example, the double metaphone primary and secondary keys for the name ‘washington’ are 'AXNK' and 'FXNK'.

For improved performance, the dbl_mp function maps the 4-character keys to 16-bit numbers and returns a composite 32-bit value (Netezza type int4) that holds both the 16-bit primary and secondary keys. So, the function dbl_mp('washington') returns the value 781598358.

There are three helper functions (pri_mp, sec_mp, and score_mp) that you can use to extract the primary and secondary keys as strings from the return value, and to run key comparisons for scoring relevance.

Primary Metaphone

SQL syntax:
<varchar(4) value> = pri_mp(<int4 dbl_mp return value>)

This helper function takes the value that is returned by a call to the dbl_mp function and returns the corresponding four-character primary metaphone string. For example, pri_mp(781598358) returns the primary key AXNK.

Secondary Metaphone

SQL syntax:
<varchar(4) value> = sec_mp(<int4 dbl_mp return value>)

This helper function takes the value that is returned by a call to the dbl_mp function and returns the corresponding four-character secondary metaphone string. For example, sec_mp(781598358) returns the secondary key FXNK.

Score Metaphones

SQL syntax:
<varchar(4) value> = score_mp(<int4 dbl_mp value 1>, <int4 dbl_mp 
value 1>, <int4 strong match value>, <int4 normal match value>, 
<int4 minor match value>, <int4 no match value>)
This helper function takes two values that are returned by the dbl_mp function and compares them to determine how closely they match. The last four arguments are the values to return for the four possible scoring or matching outcomes. With double-metaphone encoding, the two input values are said to be matches when one or more of their primary or secondary keys match; the strength of the match depends on which keys match. For example, the following list shows how to evaluate the strength of the match:
Strongest Match
Primary Key (1) = Primary Key (2)
Normal Match
Secondary Key (1) = Primary Key (2)

Primary Key (1) = Secondary Key (2)

Minimal Match
Secondary Key (1) = Secondary Key (2)

For the four match value arguments, you can specify values such as 1, 2, 3, and 4 (for strongest, normal, minimal, or no match). You can also use weighted values such as 100, 50, 25, and 0 to return more points for better match results.

For example, if you compare the double metaphone encodings for ‘washington’ (781598358) and ‘wachingten’ (7815963100), you can use the following score_mp function to determine how closely they match:
score_mp(781598358,781596310,1,2,3,4)

The function returns the value 1, which indicates a strongest match.

If you compare the encodings for ‘washington’ and ‘vachingten’ (1050031766):
score_mp(781598358,1050031766,100,50,25,0) 

The function returns the value 50, which indicates a normal match.