Invariant Character Handling
Note: This document was imported from the previous NCIP website hosted at the Colorado Department of Education.
Guideline: Implementers may assume that all characters in NCIP messages are invariant characters except the character content of elements with the fixed attribute data.
Definitions:
Here is the argument in support of this Guideline (all section citations are to Imp-1;
items without section citations were determined by inspection of the NCIP DTD):
1: The ISO standard for 7-bit character sets is ISO/IEC 646:1991, which is titled "Information technology - ISO 7-bit coded character set for information interchange (third edition)." It defines 7-bit character sets and how they are registered (see the registry here). In the terms of ISO/IEC 646, 7-bit characters sets are versions of ISO/IEC 646; for instance what is commonly called the "ASCII" character set is a version of ISO/IEC 646: ISO-IR 6, the "International Reference Version of ISO/IEC 646:1991". Across all versions of ISO/IEC 646 there are 82 graphic characters in common, and these are known as the "invariant characters of ISO/IEC 646." The "invariant characters" are themselves a version of ISO/IEC 646: ISO IR 170, the "ISO/IEC 646 Basic Character Set".
Contributors:
References:
The Guide to the use of character sets in Europe was very helpful in gaining a basic understanding of ISO/IEC 646.
Editor: John Bodfish
Guideline: Implementers may assume that all characters in NCIP messages are invariant characters except the character content of elements with the fixed attribute data.
Definitions:
- Characters are defined by XML 1.0 (Second Edition)
- Invariant Characters are those characters included in "ISO/IEC 646 Basic Character Set", or ISO-IR 1701
Here is the argument in support of this Guideline (all section citations are to Imp-1;
items without section citations were determined by inspection of the NCIP DTD):
- An NCIP message must be valid XML that conforms to the NCIP DTD (section 5.1)
- The NCIP DTD defines the content of an NCIP message as a series of elements, each of either complex, simple or EMPTY type.
- Elements of EMPTY type have no content.
- Elements of complex type are composed of other elements, and so do not require further consideration.
- Elements of simple type all have a FIXED attribute of "datatype".
- Each of these datatypes are restricted in their lexical representation by section 5.3. Taking each in turn:
- dateTime - uses the following (comma-delimited list of) characters: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, -, +, :, T, Z.
- integer - uses the following (comma-delimited list of) characters: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, -, +.
- nonNegativeInteger - uses the following (comma-delimited list of) characters: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +.
- positiveInteger - uses the following (comma-delimited list of) characters: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +.
- string - Any character in UCS2 (per the definition of "string" in 5.3 and with reference to 5.2).
- The union of all the non-string datatype characters is {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, -, +, :, T, Z}. These are all members of the invariant character set. (This is determined by inspection of ISO-IR 170.)
- Entity references are permitted (section 5.2) as substitutes for the following characters: &, <, >, ', ". Since none of those appear in the union of non-string datatype characters, we needn't further consider them.
- Besides the space character (which is an invariant character), permitted white space in an NCIP message (other than within elements of "string" datatype) are tab, carriage-return and newline. While none of these are invariant characters, neither may they appear in an NCIP message except where they may be ignored by the XML parser (and the NCIP application).
- The content of the markup (the "tags") in the NCIP DTD can be element names, attribute names, and the values of the attributes, plus white space, quote (') and double-quote (").
- Element and attribute names in the NCIP DTD are all composed of invariant characters.
- The values permitted for attributes in the NCIP DTD are all composed of invariant characters.
- The other characters permitted in XML tags ("<", ">", quote and double-quote) are invariant characters.
- The only other part of the XML document is the XML Prolog, which is precisely defined (section 6.2) and (happily!) uses only invariant characters (except for white space characters).
- Conclusion: An implementation should expect both invariant and non-invariant characters in string elements; in all other elements an implementation should expect only invariant characters.
1: The ISO standard for 7-bit character sets is ISO/IEC 646:1991, which is titled "Information technology - ISO 7-bit coded character set for information interchange (third edition)." It defines 7-bit character sets and how they are registered (see the registry here). In the terms of ISO/IEC 646, 7-bit characters sets are versions of ISO/IEC 646; for instance what is commonly called the "ASCII" character set is a version of ISO/IEC 646: ISO-IR 6, the "International Reference Version of ISO/IEC 646:1991". Across all versions of ISO/IEC 646 there are 82 graphic characters in common, and these are known as the "invariant characters of ISO/IEC 646." The "invariant characters" are themselves a version of ISO/IEC 646: ISO IR 170, the "ISO/IEC 646 Basic Character Set".
Contributors:
- John Bodfish
- Mark Wilson
- Stephen Gregory
- Julie Blume Nye
References:
Editor: John Bodfish