Docs GODI Archive
Projects Blog Link DB

Search GODI:


More options
File lib/ocaml/pkg-lib/netstring/netconversion.mli GODI Package godi-ocamlnet
Library netstring
 
   Netconversion.html    netconversion.cmi_pretty    netconversion.mli    Sources  
(* $Id: netconversion.mli 1084 2007-02-20 12:36:17Z gerd $
 * ----------------------------------------------------------------------
 *)

(** Conversion between character encodings 
 *
 * {b Contents}
 * {ul
 *   {- {!Netconversion.preliminaries}
 *     {ul
 *       {- {!Netconversion.unicode}}
 *       {- {!Netconversion.subsets}}
 *       {- {!Netconversion.linking}}
 *       {- {!Netconversion.domain}}
 *       {- {!Netconversion.problems}}}}
 *   {- {!Netconversion.interface}
 *     {ul
 *       {- {!Netconversion.direct_conv}}
 *       {- {!Netconversion.cursors}
 *           {ul {- {!Netconversion.bom}}}}
 *       {- {!Netconversion.unicode_functions}}
 *     }
 *   }
 * }
 *)


(** {1:preliminaries Preliminaries}
 *
 * A {b character set} is a set of characters where every character is
 * identified by a {b code point}. An {b encoding} is a way of 
 * representing characters from a set in byte strings. For example,
 * the Unicode character set has more than 96000 characters, and
 * the code points have values from 0 to 0x10ffff (not all code points
 * are assigned yet). The UTF-8 encoding represents the code points
 * by sequences of 1 to 4 bytes. There are also encodings that 
 * represent code points from several sets, e.g EUC-JP covers four
 * sets.
 *
 * Encodings are enumerated by the type [encoding], and names follow
 * the convention [`Enc_*], e.g. [`Enc_utf8]. 
 * Character sets are enumerated by the type
 * [charset], and names follow the convention [`Set_*], e.g.
 * [`Set_unicode].
 *
 * This module deals mainly with encodings. It is important to know
 * that the same character set may have several encodings. For example,
 * the Unicode character set can be encoded as UTF-8 or UTF-16.
 * For the 8 bit character sets, however, there is usually only one
 * encoding, e.g [`Set_iso88591] is always encoded as [`Enc_iso88591].
 *
 * In a {b single-byte encoding} every code point is represented by
 * one byte. This is what many programmers are accustomed at, and
 * what the O'Caml language specially supports: A [string] is
 * a sequence of [char]s, where [char] means an 8 bit quantity
 * interpreted as character. For example, the following piece of code allocates
 * a [string] of four [char]s, and assigns them individually:
 *
 * {[
 * let s = String.create 4 in
 * s.[0] <- 'G';
 * s.[1] <- 'e';
 * s.[2] <- 'r';
 * s.[3] <- 'd';
 * ]}
 * 
 * In a {b multi-byte encoding} there are code points that are represented
 * by several bytes. As we still represent such text as [string], the
 * problem arises that a single [char], actually a byte, often represents 
 * only a fraction of a full multi-byte character. There are two solutions:
 * - Give up the principle that text is represented by [string].
 *   This is, for example, the approach chosen by [Camomile], another O'Caml
 *   library dealing with Unicode. Instead, text is represented as
 *   [int array]. This way, the algorithms processing the text can
 *   remain the same.
 * - Give up the principle that individual characters can be directly
 *   accessed in a text. This is the primary way chosen by Ocamlnet.
 *   This means that there is not any longer the possibility to read
 *   or write the [n]th character of a text. One can, however, still 
 *   compose texts by just concatenating the strings representing
 *   individual characters. Furthermore, it is possible to define
 *   a cursor for a text that moves sequentially along the text.
 *   The consequence is that programmers are restricted to sequential
 *   algorithms. Note that the majority of text processing falls into
 *   this class.
 *
 * The corresponding piece of code for Ocamlnet's Unicode implementation
 * is:
 * {[
 * let b = Buffer.create 80 in
 * Buffer.add b (ustring_of_uchar `Enc_utf8 71);  (* 71 = code point of 'G' *)
 * Buffer.add b (ustring_of_uchar `Enc_utf8 101); (* 101 = code point of 'e' *)
 * Buffer.add b (ustring_of_uchar `Enc_utf8 114); (* 114 = code point of 'r' *)
 * Buffer.add b (ustring_of_uchar `Enc_utf8 100); (* 100 = code point of 'd' *)
 * let s = Buffer.contents b
 * ]}
 *
 * It is important to always remember that a [char] is no longer 
 * a character but simply a byte. In many of the following explanations,
 * we strictly distinguish between {b byte positions} or {b byte counts},
 * and {b character positions} or {b character counts}.
 *
 * There a number of special effects that usually only occur in
 * multi-byte encodings:
 *
 * - Bad encodings: Not every byte sequence is legal. When scanning
 *   such text, the functions will raise the exception [Malformed_code]
 *   when they find illegal bytes.
 * - Unassigned code points: It may happen that a byte sequence is
 *   a correct representation for a code point, but that the code point
 *   is unassigned in the character set. When scanning, this is also
 *   covered by the exception [Malformed_code]. When converting from
 *   one encoding to another, it is also possible that the code point
 *   is only unassigned in the target character set. This case is
 *   usually handled by a substitution function [subst], and if no such
 *   function is defined, by the exception [Cannot_represent].
 * - Incomplete characters: The trailing bytes of a string may be the
 *   correct beginning of a byte sequence for a character, but not a
 *   complete sequence. Of course, if that string is the end of a
 *   text, this is just illegal, and also a case for [Malformed_code].
 *   However, when text is processed chunk by chunk, this phenomenon
 *   may happen legally for all chunks but the last. For this reason,
 *   some of the functions below handle this case specially.
 * - Byte order marks: Some encodings have both big and little endian
 *   variants. A byte order mark at the beginning of the text declares
 *   which variant is actually used. This byte order mark is a 
 *   declaration written like a character, but actually not a 
 *   character.
 *
 * There is a special class of encodings known as {b ASCII-compatible}.
 * They are important because there are lots of programs and protocols
 * that only interpret bytes from 0 to 127, and treat the bytes from
 * 128 to 255 as data. These programs can process texts as long as
 * the bytes from 0 to 127 are used as in ASCII. Fortunately, many
 * encodings are ASCII-compatible, including UTF-8.
 *
 * {2:unicode Unicode}
 *
 * [Netconversion] is centred around Unicode.
 * The conversion from one encoding to another works by finding the
 * Unicode code point of the character
 * to convert, and by representing the code point in the target encoding,
 * even if neither encodings have to do with Unicode.
 * Of course, this approach requires that all character sets handled
 * by [Netconversion] are subsets of Unicode.
 *
 * The supported range of Unicode code points: 0 to 0xd7ff, 0xe000 to 0xfffd,
 * 0x10000 to 0x10ffff. All these code points can be represented in 
 * UTF-8 and UTF-16. [Netconversion] does not know which of the code
 * points are assigned and which not, and because of this, it simply
 * allows all code points of the mentioned ranges (but for other character
 * sets, the necessary lookup tables exist).
 *
 * {b UTF-8:} The UTF-8 representation can have one to four bytes. Malformed 
 *   byte sequences are always rejected, even those that want to cheat the
 *   reader like "0xc0 0x80" for the code point 0. There is special support
 *   for the Java variant of UTF-8 ([`Enc_java]). UTF-8 strings must not
 *   have a byte order mark (it would be interpreted as "zero-width space"
 *   character).
 *
 * {b UTF-16:} When reading from a string encoded as [`Enc_utf16], a byte
 *   order mark is expected at the beginning. The detected variant 
 *   ([`Enc_utf16_le] or [`Enc_utf16_be]) is usually returned by the parsing
 *   function. The byte order mark is not included into the output string. - 
 *   Some functions of this
 *   module cannot cope with [`Enc_utf16] (i.e. UTF-16 without endianess
 *   annotation), and will fail.
 *
 *   Once the endianess is determined, the code point 0xfeff is no longer
 *   interpreted as byte order mark, but as "zero-width non-breakable space".
 *
 *   Some code points are represented by pairs of 16 bit values, these
 *   are the so-called "surrogate pairs". They can only occur in UTF-16.
 *
 * {2:subsets Subsets of Unicode}
 *
 * The non-Unicode character sets are subsets of Unicode. Here, it may
 * happen that a Unicode code point does not have a corresponding 
 * code point. In this case, certain rules are applied to handle
 * this (see below). It is, however, ensured that every non-Unicode
 * code point has a corresponding Unicode code point. (In other words,
 * character