Text functions: Part 1
Overview
Text functions manipulate Text values and return other Text values.
Matching patterns
Matching patterns are used in certain text functions that match and/or extract portions of a string. The matching patterns are similar to SQL, and may contain the following specially-interpreted characters:
A matching pattern containing "_" will match any character in that position. For example, "_on" will match both "Jon" and "Don".
A matching pattern containing "%" will match any sequence of zero or more characters in that position. For example, "%cost%" will match any string containing the substring "cost".
A matching pattern containing "@" will cause a following "%" or "_" to be interpreted as a literal character instead of a matching pattern. Use this to match special characters. For example "%@%%" will match any string containing a "%" character.
The Substitute function does not use matching patterns; it replaces text on a character-by-character basis.
Asc
Returns an integer value that is a valid Unicode code point.
Syntax
Asc( text )
The required argument text may be longer than a single character, but only the value of the first character is returned, (or 0 if text is empty).
Example
Asc("€") returns 8364.
Asc("A") returns 65.
Asc("10") returns 49.
Asc("13") returns 49.
if (ASC(substring(X,1,1)) = 13) then "CR"
elseif (ASC(substring(X,1,1)) = 10) then "LF"
else substring(x,1,1)
endif
The sample above detects non-printing carriage return and linefeed characters.
AsText
Returns value as a text string.
Syntax
AsText( value )
The required argument value may be of any type except Binary. The length of the resulting string is dependent on the data type of value.
If value is of type Document, the returned text is a JSON Unicode representation of value.
Example
AsText(100)
returns "100" as a text string.AsText(CurrentDate())
returns (for example) "22 Mar 2007".
Chr (integer constant)
Returns a single-character string containing the character associated with the specified Unicode point value.
Syntax
Chr( integer_constant )
The required argument integer_constant must be a literal Integer representing a valid Unicode code point of 9, 10, 13, and anything within the range 32 – 1,114,111.
Remarks
This function will only create characters that are valid in XML. Characters outside this range will result in an error value.
Examples
Chr( 65 )
returns "A".Chr( 65 + 32 )
returns "a".
ChrEx
Returns a single-character string containing the character associated with the specified Unicode point value.
Syntax
Chr( integer_constant )
The required argument integer_constant must be a literal Integer representing a valid Unicode code point.
Remarks
The argument integer_constant may be any integer, but 0 will result in a null value and 1 will result in an error code. All other values will be interpreted as code points, which may not work correctly in a given context if not within the range of valid XML characters.
Example
Chr( 12 )
returns a form-feed character.
ASCII table
The following table shows decimal code values for the standard printable ASCII character set and commonly used control characters.
Decimal | Character | Decimal | Character | Decimal | Character |
---|---|---|---|---|---|
9 | horizontal tab | 60 | < | 94 | ^ |
10 | line feed, new line | 61 | = | 95 | _ |
11 | vertical tab | 62 | > | 96 | ` |
12 | form feed, new page | 63 | ? | 97 | a |
13 | carriage return | 64 | 96 | 98 | b |
27 | escape | 65 | A | 99 | c |
32 | space | 66 | B | 100 | d |
33 | ! | 67 | C | 101 | e |
34 |
| 68 | D | 102 | f |
35 | # | 69 | E | 103 | g |
36 | $ | 70 | F | 104 | h |
37 | % | 71 | G | 105 | i |
38 | & | 72 | H | 106 | j |
39 |
| 73 | I | 107 | k |
40 | ( | 74 | J | 108 | l |
41 | ) | 75 | K | 109 | m |
42 | * | 76 | L | 110 | n |
43 | + | 77 | M | 111 | o |
44 | , | 78 | N | 112 | p |
45 | - | 79 | O | 113 | q |
46 | . | 80 | P | 114 | r |
47 | / | 81 | Q | 115 | s |
48 | 0 | 82 | R | 116 | t |
49 | 1 | 83 | S | 117 | u |
50 | 2 | 84 | T | 118 | v |
51 | 3 | 85 | U | 119 | w |
52 | 4 | 86 | V | 120 | x |
53 | 5 | 87 | W | 121 | y |
54 | 6 | 88 | X | 122 | z |
55 | 7 | 89 | Y | 123 | { |
56 | 8 | 90 | Z | 124 | | |
57 | 9 | 91 | [ | 125 | } |
58 | : | 92 | \ | 126 | ~ |
59 | ; | 93 | ] | 127 | DEL |
ConcatenateRecord
Formats all fields of a record into a single text field, with input fields delimited by the specified character.
Syntax
ConcatenateRecord( text )
The required argument text must be a single character of type text.
Examples
ConcatenateRecord( "," )
might return "LESTER VANDERGRIFT,HERITAGE WOODS,642 MAIN ST,AGAWAM,MA,01001.".
CountSymbols
Counts the number of "symbols" in value, where symbols are defined as consecutive sequences of letters or numbers, or single punctuation marks. The result returned is an Integer.
Syntax
CountSymbols( value )
The required argument value may be of any type.
Examples
CountSymbols( "The quick brown fox!" )
returns 5.CountSymbols( "fox, cat, dog, cow" )
returns 7.
CountWords
Counts the number of "words" in value, where words are defined as consecutive sequences of letters or numbers. The result returned is an Integer.
Syntax
CountWords( value )
The required argument value may be of any type.
Example
CountWords( "The quick, brown fox." )
returns 4.
DoubleMetaphone1
Returns the first result of the "double metaphone" phonetic matching algorithm.
Syntax
DoubleMetaphone1( value )
The required argument value may be of any type.
Remarks
The Double Metaphone algorithm is used to generate match keys or lookup keys from input text. It is often used in searching and matching algorithms as a "fuzzy index" key. See http://en.wikipedia.org/wiki/Double_Metaphone for details.
Both DoubleMetaphone1 and DoubleMetaphone2 return long keys, while some metaphone implementations restrict the key to four characters. If you require a short key, use the SubString function: Substring(DoubleMetaphone1(VALUE), 1, 4)
Alternatively, assign the result to a length-4 field.
We do not recommend using Metaphone for record matching. Use our matching macros and tools instead.
Example
DoubleMetaphone1( "YANKELOVICH" )
returns ANKLFX.
DoubleMetaphone2
Returns the second result of the "double metaphone" phonetic matching algorithm.
Syntax
DoubleMetaphone2( value )
The required argument value may be of any type.
Remarks
The Double Metaphone algorithm is used to generate match keys or lookup keys from input text. It is often used in searching and matching algorithms as a "fuzzy index" key. See http://en.wikipedia.org/wiki/Double_Metaphone for details.
Both DoubleMetaphone1 and DoubleMetaphone2 return long keys, while some metaphone implementations restrict the key to four characters. If you require a short key, use the SubString function: Substring(DoubleMetaphone1(VALUE), 1, 4)
Alternatively, assign the result to a length-4 field.
We do not recommend using Metaphone for record matching. Use our matching macros and tools instead.
Example
DoubleMetaphone1( "YANKELOVICH" )
returns ANKLFK.
Metaphone limitations
We do not recommend the use of Metaphone (or similar phonetic codes) for record matching. Though some legacy systems use Metaphone or other match keys for record matching, we do not recommend this approach due to its poor quality. For example, consider matching the following names:
JON FRANK
JON RANK
J BOB FRANK
JONATHON "BUZZ LIGHTYEAR" FRANK
MR JON E FRANK JR ESQ, VP of SALES
FRANK JON
These names are likely to be the same entity. But phonetic match keys often fail when confronted with these kinds of data issues, which may include:
Misspellings
Initials vs. full names
Missing name components
Diminutive vs. formal names
Nicknames and other garbage present
Honorifics and titles present
Reversed name components
Data Management's (DMs) Name Parser and other cleanup logic can help, but it is very difficult to counter many of these issues. Nickname and diminutive name standardization helps, but also fails on easy cases like these:
SAM SMITH
(is this SAMANTHA or SAMUEL?)SAMUEL SMITH
SAMANTHA SMITH
PAT JONES
(is this PATRICIA or PATRICK?)PATRICIA JONES
PATRICK JONES
Similar issues occur with company names, for example:
INTERNATIONAL BUSINESS MACHINES
IBM
I.B.M.
I B M
INTL BUS MACH
INTL BUSINESS MACH, MEMPHIS SALES OFFICE
And similar problems are also found in addresses, especially those that fail to standardize for various reasons:
1200 RFK PARKWAY
1200 RFK PARKWAY #100
120 RFK PARKWAY
1200 ROBERT F KENNEDY PARKWAY
1200 STATE HIGHWAY 46
These issues include:
Presence/absence of unit designator
Typo on address number
Acronym vs. long form of street name
Street name alias
This is why Data Management uses record comparison within a one or more match segments, instead of match keys.
EditDistance
Computes a score a text value measuring the similarity of two text values.
Syntax
EditDistance( value1, value2 )
The required arguments value1 and value 2 may be of any type.
Remarks
Returns a score in the range 0 to 100 measuring the similarity of value1 and value2. A score of zero means the two values are unrelated. A score of 100 means the two values are identical. Case is ignored in the comparison.
This function first compares the two strings and counts the number of errors in the comparison. Errors include:
Substitutions ("John" vs. "Johx")
Transpositions ("John" vs. "Jonh")
Deletions ("John" vs. "Jon") which count as two errors
The final score is then calculated as follows:
100 – error count * 100 / length of longest value
Example
EditDistance( "John Smith", "John Smyth" )
returns an EditDistance of 90.