Public Types | |
enum | { Unknown, UTF_8, UTF_8N, UTF_16, UTF_16BE, UTF_16LE, UTF_32, UTF_32BE, UTF_32LE } |
enum | { Unknown, UTF_8, UTF_8N, UTF_16, UTF_16BE, UTF_16LE, UTF_32, UTF_32BE, UTF_32LE } |
enum | { Unknown, UTF_8, UTF_8N, UTF_16, UTF_16BE, UTF_16LE, UTF_32, UTF_32BE, UTF_32LE } |
enum | { Unknown, UTF_8, UTF_8N, UTF_16, UTF_16BE, UTF_16LE, UTF_32, UTF_32BE, UTF_32LE } |
Public Member Functions | |
void | dthis (int size=0) |
uint | type () |
void[] | convert (void[] src, uint srcType, uint dstType) |
struct | Into (T) |
struct | From (T) |
alias Into!(char) IntoUtf8; | |
Static Public Member Functions | |
static bool | isValid (int encoding) |
static bool | isValid (int encoding) |
static bool | isValid (int encoding) |
static bool | isValid (int encoding) |
Private Member Functions | |
struct | Into (T) |
struct | From (T) |
void[] | update (void[] t) |
Static Private Member Functions | |
static void | error (char[] msg) |
static char[] | toUtf8 (wchar[] input, char[] output=null, uint *ate=null) |
static wchar[] | toUtf16 (char[] input, wchar[] output=null, uint *ate=null) |
static char[] | toUtf8 (dchar[] input, char[] output=null, uint *ate=null) |
static dchar[] | toUtf32 (char[] input, dchar[] output=null, uint *ate=null) |
static wchar[] | toUtf16 (dchar[] input, wchar[] output=null, uint *ate=null) |
static dchar[] | toUtf32 (wchar[] input, dchar[] output=null, uint *ate=null) |
Private Attributes | |
uint | _type = Type.Utf16 |
void[] | tmp |
These routines were tuned on an Intel P4; other devices may work more efficiently with a slightly different approach, though this is likely to be reasonably optimal on AMD x86 CPUs also. These algorithms would benefit significantly from those extra AMD64 registers. On a 3GHz P4, the dchar/char conversions take around 2500ns to process an array of 1000 ASCII elements. Invoking the memory manager doubles that period, and quadruples the time for arrays of 100 elements. Memory allocation can slow down notably in a multi-threaded environment, so avoid that where possible.
Surrogate-pairs are dealt with in a non-optimal fashion when transcoding between utf16 and utf8. Such cases are considered to be boundary-conditions for this module.
There are three common cases where the input may be incomplete, including each 'widening' case of utf8 => utf16, utf8 => utf32, and utf16 => utf32. An edge-case is utf16 => utf8, if surrogate pairs are present. Such cases will throw an exception, unless streaming-mode is enabled ~ in the latter mode, an additional integer is returned indicating how many elements of the input have been consumed. In all cases, a correct slice of the output is returned.
For details on Unicode processing see $(LINK http://www.utf-8.com/) $(LINK http://www.hackcraft.net/xmlUnicode/) $(LINK http://www.azillionmonkeys.com/qed/unicode.html/) $(LINK http://icu.sourceforge.net/docs/papers/forms_of_unicode/)
Definition at line 89 of file Unicode.d.
|
|
|
Definition at line 53 of file Copy (2) of Unicode.d. |
|
Definition at line 53 of file Copy (3) of Unicode.d. |
|
Definition at line 53 of file Copy of Unicode.d. |
|
|
|
Definition at line 117 of file Unicode.d. References Exception. |
|
Encode Utf8 up to a maximum of 4 bytes long (five & six byte variations are not supported). If the output is provided off the stack, it should be large enough to encompass the entire transcoding; failing to do so will cause the output to be moved onto the heap instead. Returns a slice of the output buffer, corresponding to the converted characters. For optimum performance, the returned buffer should be specified as 'output' on subsequent calls. For example: char[] output; wchar[] result = toUtf8 (input, output); reset output after a realloc if (result.length > output.length) output = result; Definition at line 154 of file Unicode.d. References toUtf32(). |
|
Decode Utf8 produced by the above toUtf8() method. If the output is provided off the stack, it should be large enough to encompass the entire transcoding; failing to do so will cause the output to be moved onto the heap instead. Returns a slice of the output buffer, corresponding to the converted characters. For optimum performance, the returned buffer should be specified as 'output' on subsequent calls. Definition at line 229 of file Unicode.d. |
|
Encode Utf8 up to a maximum of 4 bytes long (five & six byte variations are not supported). Throws an exception where the input dchar is greater than 0x10ffff. If the output is provided off the stack, it should be large enough to encompass the entire transcoding; failing to do so will cause the output to be moved onto the heap instead. Returns a slice of the output buffer, corresponding to the converted characters. For optimum performance, the returned buffer should be specified as 'output' on subsequent calls. Definition at line 313 of file Unicode.d. References error(). |
|
Decode Utf8 produced by the above toUtf8() method. If the output is provided off the stack, it should be large enough to encompass the entire transcoding; failing to do so will cause the output to be moved onto the heap instead. Returns a slice of the output buffer, corresponding to the converted characters. For optimum performance, the returned buffer should be specified as 'output' on subsequent calls. Definition at line 396 of file Unicode.d. References error(). |
|
Encode Utf16 up to a maximum of 2 bytes long. Throws an exception where the input dchar is greater than 0x10ffff. If the output is provided off the stack, it should be large enough to encompass the entire transcoding; failing to do so will cause the output to be moved onto the heap instead. Returns a slice of the output buffer, corresponding to the converted characters. For optimum performance, the returned buffer should be specified as 'output' on subsequent calls. Definition at line 486 of file Unicode.d. References error(). |
|
Decode Utf16 produced by the above toUtf16() method. If the output is provided off the stack, it should be large enough to encompass the entire transcoding; failing to do so will cause the output to be moved onto the heap instead. Returns a slice of the output buffer, corresponding to the converted characters. For optimum performance, the returned buffer should be specified as 'output' on subsequent calls. Definition at line 550 of file Unicode.d. References error(). |
|
Convert from an external coding of 'type' to an internally normalized representation of T. T refers to the destination, whereas 'type' refers to the source. Definition at line 616 of file Unicode.d. References convert(), Into(), toUtf16(), toUtf32(), toUtf8(), type(), and type(). Referenced by Into(). |
|
Convert to an external coding of 'type' from an internally normalized representation of T. T refers to the source, whereas 'type' is the destination. Definition at line 691 of file Unicode.d. References convert(), From(), toUtf16(), toUtf32(), toUtf8(), type(), and type(). Referenced by From(). |
|
Definition at line 69 of file Copy (2) of Unicode.d. References tmp. |
|
Definition at line 74 of file Copy (2) of Unicode.d. References _type. |
|
Definition at line 79 of file Copy (2) of Unicode.d. References tmp. |
|
Definition at line 90 of file Copy (2) of Unicode.d. |
|
Definition at line 100 of file Copy (2) of Unicode.d. |
|
Definition at line 69 of file Copy (3) of Unicode.d. |
|
Convert from an external coding of 'type' to an internally normalized representation of T. T refers to the destination, whereas 'type' refers to the source. Definition at line 84 of file Copy (3) of Unicode.d. |
|
alias Into!(char) IntoUtf8; Convert to an external coding of 'type' from an internally normalized representation of T. T refers to the source, whereas 'type' is the destination. Definition at line 164 of file Copy (3) of Unicode.d. |
|
Definition at line 69 of file Copy of Unicode.d. |
|
Definition at line 65 of file Copy (2) of Unicode.d. Referenced by type(). |
|
Definition at line 67 of file Copy (2) of Unicode.d. |