Main Page | Class Hierarchy | Alphabetical List | Class List | Directories | File List | Class Members | File Members | Related Pages

Unicode Struct Reference

List of all members.

Public Types

enum  {
  Unknown, UTF_8, UTF_8N, UTF_16,
  UTF_16BE, UTF_16LE, UTF_32, UTF_32BE,
  UTF_32LE
}
enum  {
  Unknown, UTF_8, UTF_8N, UTF_16,
  UTF_16BE, UTF_16LE, UTF_32, UTF_32BE,
  UTF_32LE
}
enum  {
  Unknown, UTF_8, UTF_8N, UTF_16,
  UTF_16BE, UTF_16LE, UTF_32, UTF_32BE,
  UTF_32LE
}
enum  {
  Unknown, UTF_8, UTF_8N, UTF_16,
  UTF_16BE, UTF_16LE, UTF_32, UTF_32BE,
  UTF_32LE
}

Public Member Functions

void dthis (int size=0)
uint type ()
void[] convert (void[] src, uint srcType, uint dstType)
struct Into (T)
struct From (T)
 alias Into!(char) IntoUtf8;

Static Public Member Functions

static bool isValid (int encoding)
static bool isValid (int encoding)
static bool isValid (int encoding)
static bool isValid (int encoding)

Private Member Functions

struct Into (T)
struct From (T)
void[] update (void[] t)

Static Private Member Functions

static void error (char[] msg)
static char[] toUtf8 (wchar[] input, char[] output=null, uint *ate=null)
static wchar[] toUtf16 (char[] input, wchar[] output=null, uint *ate=null)
static char[] toUtf8 (dchar[] input, char[] output=null, uint *ate=null)
static dchar[] toUtf32 (char[] input, dchar[] output=null, uint *ate=null)
static wchar[] toUtf16 (dchar[] input, wchar[] output=null, uint *ate=null)
static dchar[] toUtf32 (wchar[] input, dchar[] output=null, uint *ate=null)

Private Attributes

uint _type = Type.Utf16
void[] tmp

Detailed Description

Fast Unicode transcoders. These are particularly sensitive to minor changes on 32bit x86 devices, because the register set of those devices is so small. Beware of subtle changes which might extend the execution-period by as much as 200%. Because of this, three of the six transcoders might read past the end of input by one, two, or three bytes before arresting themselves. Note that support for streaming adds a 15% overhead to the dchar => char conversion, but has little effect on the others.

These routines were tuned on an Intel P4; other devices may work more efficiently with a slightly different approach, though this is likely to be reasonably optimal on AMD x86 CPUs also. These algorithms would benefit significantly from those extra AMD64 registers. On a 3GHz P4, the dchar/char conversions take around 2500ns to process an array of 1000 ASCII elements. Invoking the memory manager doubles that period, and quadruples the time for arrays of 100 elements. Memory allocation can slow down notably in a multi-threaded environment, so avoid that where possible.

Surrogate-pairs are dealt with in a non-optimal fashion when transcoding between utf16 and utf8. Such cases are considered to be boundary-conditions for this module.

There are three common cases where the input may be incomplete, including each 'widening' case of utf8 => utf16, utf8 => utf32, and utf16 => utf32. An edge-case is utf16 => utf8, if surrogate pairs are present. Such cases will throw an exception, unless streaming-mode is enabled ~ in the latter mode, an additional integer is returned indicating how many elements of the input have been consumed. In all cases, a correct slice of the output is returned.

For details on Unicode processing see $(LINK http://www.utf-8.com/) $(LINK http://www.hackcraft.net/xmlUnicode/) $(LINK http://www.azillionmonkeys.com/qed/unicode.html/) $(LINK http://icu.sourceforge.net/docs/papers/forms_of_unicode/)

Definition at line 89 of file Unicode.d.


Member Enumeration Documentation

anonymous enum
 

Enumeration values:
Unknown 
UTF_8 
UTF_8N 
UTF_16 
UTF_16BE 
UTF_16LE 
UTF_32 
UTF_32BE 
UTF_32LE 

Definition at line 92 of file Unicode.d.

anonymous enum
 

Enumeration values:
Unknown 
UTF_8 
UTF_8N 
UTF_16 
UTF_16BE 
UTF_16LE 
UTF_32 
UTF_32BE 
UTF_32LE 

Definition at line 53 of file Copy (2) of Unicode.d.

anonymous enum
 

Enumeration values:
Unknown 
UTF_8 
UTF_8N 
UTF_16 
UTF_16BE 
UTF_16LE 
UTF_32 
UTF_32BE 
UTF_32LE 

Definition at line 53 of file Copy (3) of Unicode.d.

anonymous enum
 

Enumeration values:
Unknown 
UTF_8 
UTF_8N 
UTF_16 
UTF_16BE 
UTF_16LE 
UTF_32 
UTF_32BE 
UTF_32LE 

Definition at line 53 of file Copy of Unicode.d.


Member Function Documentation

static bool isValid int  encoding  )  [inline, static]
 

Definition at line 108 of file Unicode.d.

References Unknown, and UTF_32LE.

static void error char[]  msg  )  [inline, static, private]
 

Definition at line 117 of file Unicode.d.

References Exception.

Referenced by toUtf16(), toUtf32(), and toUtf8().

static char [] toUtf8 wchar[]  input,
char[]  output = null,
uint *  ate = null
[inline, static, private]
 

Encode Utf8 up to a maximum of 4 bytes long (five & six byte variations are not supported).

If the output is provided off the stack, it should be large enough to encompass the entire transcoding; failing to do so will cause the output to be moved onto the heap instead.

Returns a slice of the output buffer, corresponding to the converted characters. For optimum performance, the returned buffer should be specified as 'output' on subsequent calls. For example:

char[] output;

wchar[] result = toUtf8 (input, output);

reset output after a realloc if (result.length > output.length) output = result;

Definition at line 154 of file Unicode.d.

References toUtf32().

Referenced by From(), and Into().

static wchar [] toUtf16 char[]  input,
wchar[]  output = null,
uint *  ate = null
[inline, static, private]
 

Decode Utf8 produced by the above toUtf8() method.

If the output is provided off the stack, it should be large enough to encompass the entire transcoding; failing to do so will cause the output to be moved onto the heap instead.

Returns a slice of the output buffer, corresponding to the converted characters. For optimum performance, the returned buffer should be specified as 'output' on subsequent calls.

Definition at line 229 of file Unicode.d.

References error(), and toUtf32().

Referenced by From(), and Into().

static char [] toUtf8 dchar[]  input,
char[]  output = null,
uint *  ate = null
[inline, static, private]
 

Encode Utf8 up to a maximum of 4 bytes long (five & six byte variations are not supported). Throws an exception where the input dchar is greater than 0x10ffff.

If the output is provided off the stack, it should be large enough to encompass the entire transcoding; failing to do so will cause the output to be moved onto the heap instead.

Returns a slice of the output buffer, corresponding to the converted characters. For optimum performance, the returned buffer should be specified as 'output' on subsequent calls.

Definition at line 313 of file Unicode.d.

References error().

static dchar [] toUtf32 char[]  input,
dchar[]  output = null,
uint *  ate = null
[inline, static, private]
 

Decode Utf8 produced by the above toUtf8() method.

If the output is provided off the stack, it should be large enough to encompass the entire transcoding; failing to do so will cause the output to be moved onto the heap instead.

Returns a slice of the output buffer, corresponding to the converted characters. For optimum performance, the returned buffer should be specified as 'output' on subsequent calls.

Definition at line 396 of file Unicode.d.

References error().

Referenced by From(), Into(), toUtf16(), and toUtf8().

static wchar [] toUtf16 dchar[]  input,
wchar[]  output = null,
uint *  ate = null
[inline, static, private]
 

Encode Utf16 up to a maximum of 2 bytes long. Throws an exception where the input dchar is greater than 0x10ffff.

If the output is provided off the stack, it should be large enough to encompass the entire transcoding; failing to do so will cause the output to be moved onto the heap instead.

Returns a slice of the output buffer, corresponding to the converted characters. For optimum performance, the returned buffer should be specified as 'output' on subsequent calls.

Definition at line 486 of file Unicode.d.

References error().

static dchar [] toUtf32 wchar[]  input,
dchar[]  output = null,
uint *  ate = null
[inline, static, private]
 

Decode Utf16 produced by the above toUtf16() method.

If the output is provided off the stack, it should be large enough to encompass the entire transcoding; failing to do so will cause the output to be moved onto the heap instead.

Returns a slice of the output buffer, corresponding to the converted characters. For optimum performance, the returned buffer should be specified as 'output' on subsequent calls.

Definition at line 550 of file Unicode.d.

References error().

struct Into  )  [inline, private]
 

Convert from an external coding of 'type' to an internally normalized representation of T.

T refers to the destination, whereas 'type' refers to the source.

Definition at line 616 of file Unicode.d.

References convert(), Into(), toUtf16(), toUtf32(), toUtf8(), type(), and type().

Referenced by Into().

struct From  )  [inline, private]
 

Convert to an external coding of 'type' from an internally normalized representation of T.

T refers to the source, whereas 'type' is the destination.

Definition at line 691 of file Unicode.d.

References convert(), From(), toUtf16(), toUtf32(), toUtf8(), type(), and type().

Referenced by From().

void dthis int  size = 0  )  [inline]
 

Definition at line 69 of file Copy (2) of Unicode.d.

References tmp.

uint type  )  [inline]
 

Definition at line 74 of file Copy (2) of Unicode.d.

References _type.

Referenced by From(), and Into().

void [] update void[]  t  )  [inline, private]
 

Definition at line 79 of file Copy (2) of Unicode.d.

References tmp.

static bool isValid int  encoding  )  [inline, static]
 

Definition at line 90 of file Copy (2) of Unicode.d.

References Unknown, and UTF_32LE.

void [] convert void[]  src,
uint  srcType,
uint  dstType
[inline]
 

Definition at line 100 of file Copy (2) of Unicode.d.

References assert(), tmp, and Utf.

Referenced by From(), and Into().

static bool isValid int  encoding  )  [inline, static]
 

Definition at line 69 of file Copy (3) of Unicode.d.

References Unknown, and UTF_32LE.

struct Into  )  [inline]
 

Convert from an external coding of 'type' to an internally normalized representation of T.

T refers to the destination, whereas 'type' refers to the source.

Definition at line 84 of file Copy (3) of Unicode.d.

References convert(), Into(), type(), type(), and Utf.

struct From  )  [inline]
 

alias Into!(char) IntoUtf8;

Convert to an external coding of 'type' from an internally normalized representation of T.

T refers to the source, whereas 'type' is the destination.

Definition at line 164 of file Copy (3) of Unicode.d.

References convert(), From(), type(), type(), and Utf.

static bool isValid int  encoding  )  [inline, static]
 

Definition at line 69 of file Copy of Unicode.d.

References Unknown, and UTF_32LE.


Member Data Documentation

uint _type = Type.Utf16 [private]
 

Definition at line 65 of file Copy (2) of Unicode.d.

Referenced by type().

void [] tmp [private]
 

Definition at line 67 of file Copy (2) of Unicode.d.

Referenced by convert(), dthis(), and update().


The documentation for this struct was generated from the following files:
Generated on Sat Dec 24 17:28:43 2005 for Mango by  doxygen 1.4.0