(* $Id: netconversion.mli 1084 2007-02-20 12:36:17Z gerd $
* ----------------------------------------------------------------------
*)
(** Conversion between character encodings
*
* {b Contents}
* {ul
* {- {!Netconversion.preliminaries}
* {ul
* {- {!Netconversion.unicode}}
* {- {!Netconversion.subsets}}
* {- {!Netconversion.linking}}
* {- {!Netconversion.domain}}
* {- {!Netconversion.problems}}}}
* {- {!Netconversion.interface}
* {ul
* {- {!Netconversion.direct_conv}}
* {- {!Netconversion.cursors}
* {ul {- {!Netconversion.bom}}}}
* {- {!Netconversion.unicode_functions}}
* }
* }
* }
*)
(** {1:preliminaries Preliminaries}
*
* A {b character set} is a set of characters where every character is
* identified by a {b code point}. An {b encoding} is a way of
* representing characters from a set in byte strings. For example,
* the Unicode character set has more than 96000 characters, and
* the code points have values from 0 to 0x10ffff (not all code points
* are assigned yet). The UTF-8 encoding represents the code points
* by sequences of 1 to 4 bytes. There are also encodings that
* represent code points from several sets, e.g EUC-JP covers four
* sets.
*
* Encodings are enumerated by the type [encoding], and names follow
* the convention [`Enc_*], e.g. [`Enc_utf8].
* Character sets are enumerated by the type
* [charset], and names follow the convention [`Set_*], e.g.
* [`Set_unicode].
*
* This module deals mainly with encodings. It is important to know
* that the same character set may have several encodings. For example,
* the Unicode character set can be encoded as UTF-8 or UTF-16.
* For the 8 bit character sets, however, there is usually only one
* encoding, e.g [`Set_iso88591] is always encoded as [`Enc_iso88591].
*
* In a {b single-byte encoding} every code point is represented by
* one byte. This is what many programmers are accustomed at, and
* what the O'Caml language specially supports: A [string] is
* a sequence of [char]s, where [char] means an 8 bit quantity
* interpreted as character. For example, the following piece of code allocates
* a [string] of four [char]s, and assigns them individually:
*
* {[
* let s = String.create 4 in
* s.[0] <- 'G';
* s.[1] <- 'e';
* s.[2] <- 'r';
* s.[3] <- 'd';
* ]}
*
* In a {b multi-byte encoding} there are code points that are represented
* by several bytes. As we still represent such text as [string], the
* problem arises that a single [char], actually a byte, often represents
* only a fraction of a full multi-byte character. There are two solutions:
* - Give up the principle that text is represented by [string].
* This is, for example, the approach chosen by [Camomile], another O'Caml
* library dealing with Unicode. Instead, text is represented as
* [int array]. This way, the algorithms processing the text can
* remain the same.
* - Give up the principle that individual characters can be directly
* accessed in a text. This is the primary way chosen by Ocamlnet.
* This means that there is not any longer the possibility to read
* or write the [n]th character of a text. One can, however, still
* compose texts by just concatenating the strings representing
* individual characters. Furthermore, it is possible to define
* a cursor for a text that moves sequentially along the text.
* The consequence is that programmers are restricted to sequential
* algorithms. Note that the majority of text processing falls into
* this class.
*
* The corresponding piece of code for Ocamlnet's Unicode implementation
* is:
* {[
* let b = Buffer.create 80 in
* Buffer.add b (ustring_of_uchar `Enc_utf8 71); (* 71 = code point of 'G' *)
* Buffer.add b (ustring_of_uchar `Enc_utf8 101); (* 101 = code point of 'e' *)
* Buffer.add b (ustring_of_uchar `Enc_utf8 114); (* 114 = code point of 'r' *)
* Buffer.add b (ustring_of_uchar `Enc_utf8 100); (* 100 = code point of 'd' *)
* let s = Buffer.contents b
* ]}
*
* It is important to always remember that a [char] is no longer
* a character but simply a byte. In many of the following explanations,
* we strictly distinguish between {b byte positions} or {b byte counts},
* and {b character positions} or {b character counts}.
*
* There a number of special effects that usually only occur in
* multi-byte encodings:
*
* - Bad encodings: Not every byte sequence is legal. When scanning
* such text, the functions will raise the exception [Malformed_code]
* when they find illegal bytes.
* - Unassigned code points: It may happen that a byte sequence is
* a correct representation for a code point, but that the code point
* is unassigned in the character set. When scanning, this is also
* covered by the exception [Malformed_code]. When converting from
* one encoding to another, it is also possible that the code point
* is only unassigned in the target character set. This case is
* usually handled by a substitution function [subst], and if no such
* function is defined, by the exception [Cannot_represent].
* - Incomplete characters: The trailing bytes of a string may be the
* correct beginning of a byte sequence for a character, but not a
* complete sequence. Of course, if that string is the end of a
* text, this is just illegal, and also a case for [Malformed_code].
* However, when text is processed chunk by chunk, this phenomenon
* may happen legally for all chunks but the last. For this reason,
* some of the functions below handle this case specially.
* - Byte order marks: Some encodings have both big and little endian
* variants. A byte order mark at the beginning of the text declares
* which variant is actually used. This byte order mark is a
* declaration written like a character, but actually not a
* character.
*
* There is a special class of encodings known as {b ASCII-compatible}.
* They are important because there are lots of programs and protocols
* that only interpret bytes from 0 to 127, and treat the bytes from
* 128 to 255 as data. These programs can process texts as long as
* the bytes from 0 to 127 are used as in ASCII. Fortunately, many
* encodings are ASCII-compatible, including UTF-8.
*
* {2:unicode Unicode}
*
* [Netconversion] is centred around Unicode.
* The conversion from one encoding to another works by finding the
* Unicode code point of the character
* to convert, and by representing the code point in the target encoding,
* even if neither encodings have to do with Unicode.
* Of course, this approach requires that all character sets handled
* by [Netconversion] are subsets of Unicode.
*
* The supported range of Unicode code points: 0 to 0xd7ff, 0xe000 to 0xfffd,
* 0x10000 to 0x10ffff. All these code points can be represented in
* UTF-8 and UTF-16. [Netconversion] does not know which of the code
* points are assigned and which not, and because of this, it simply
* allows all code points of the mentioned ranges (but for other character
* sets, the necessary lookup tables exist).
*
* {b UTF-8:} The UTF-8 representation can have one to four bytes. Malformed
* byte sequences are always rejected, even those that want to cheat the
* reader like "0xc0 0x80" for the code point 0. There is special support
* for the Java variant of UTF-8 ([`Enc_java]). UTF-8 strings must not
* have a byte order mark (it would be interpreted as "zero-width space"
* character).
*
* {b UTF-16:} When reading from a string encoded as [`Enc_utf16], a byte
* order mark is expected at the beginning. The detected variant
* ([`Enc_utf16_le] or [`Enc_utf16_be]) is usually returned by the parsing
* function. The byte order mark is not included into the output string. -
* Some functions of this
* module cannot cope with [`Enc_utf16] (i.e. UTF-16 without endianess
* annotation), and will fail.
*
* Once the endianess is determined, the code point 0xfeff is no longer
* interpreted as byte order mark, but as "zero-width non-breakable space".
*
* Some code points are represented by pairs of 16 bit values, these
* are the so-called "surrogate pairs". They can only occur in UTF-16.
*
* {2:subsets Subsets of Unicode}
*
* The non-Unicode character sets are subsets of Unicode. Here, it may
* happen that a Unicode code point does not have a corresponding
* code point. In this case, certain rules are applied to handle
* this (see below). It is, however, ensured that every non-Unicode
* code point has a corresponding Unicode code point. (In other words,
* character