Unicode in Emacs

Characters

Emacs lisp has a special read syntax for characters: the ? question mark. Such syntax is necessary to distinguish the character à from the symbol à.

Characters are Integers

Characters evaluate to integers:

#+RESULTS:
224

We can see that and 224 are equal in (almost?) every way:

(equal ?à 224)
#+RESULTS:
t
(eq ?à 224)
#+RESULTS:
t
(char-equal ?à 224)
#+RESULTS:
t

Converting Between Chars and Integers

(string-to-char "à")
#+RESULTS:
224
(char-to-string 224)
#+RESULTS:
à

The argument to char-to-string must be a character, so it can either be given as:

  1. A character using ? syntax
  2. An integer, using any of the ways to express an integer

For example, as an integer using hex notation:

(char-to-string #xe0)
#+RESULTS:
à

Or, as a character using hex notation:

(char-to-string ?\xe0)
#+RESULTS:
à

Why Have Two Representations?

The reader syntax ? simply allows you to express a number using a character. Sometimes that is very helpful and provides clarity (when you are dealing with characters) and sometims that would be silly (when dealing with numbers).

The elisp manual notes:

whether an integer is a character or not is determined only by how it is used

(+ ?à 2) ; Usually not helpful
#+RESULTS:
226
(make-string 5 ?à) ; Helpful
#+RESULTS:
ààààà

Which, by the way, could also be written as:

(make-string 5 224) ; Not helpful
#+RESULTS:
ààààà

In Emacs, all characters are integers, but not all integers are characters. A character's corresponding integer is simply the Unicode number (i.e. the Unicode code point) of the character.

Unicode code points are defined from the integers #x0 to

(char-to-string (+ 1 #x10FFFF))

The ? read syntax for a character also allows for the character to be expressed as a hex number or an octal number. As before, the expression evaluates to a decimal number which represents the character's code point.

?\xe0
#+RESULTS:
224
?\340
#+RESULTS:
224

TODO: wtf is this?

The function make-char returns an integer representing the character at the given position in a… what?

(make-char 'unicode 0 0 224 0)
#+RESULTS:
224

Unicode Escape Sequences

A character can be defined using a unicode escape sequence. There are two forms for Unicode escape sequences:

  1. \uXXXX (\u and four hex digits)
  2. \U00XXXXXX (\U00 and six hex digits)

Evaluating a character with a Unicode escape sequence returns an integer:

?\u00e0
#+RESULTS:
224
?\U0001F638
#+RESULTS:
128568

Render the character using char-to-string:

(char-to-string ?\u00e0)
#+RESULTS:
à

Also, evaluating a string with a Unicode escape sequence returns a string:

"\U0001F638"
#+RESULTS:
😸

Convert Unicode Code Point to Character

The function (char-to-string CHAR) returns the character at the decimal code point CHAR. Unicode code points in the "U+2388" format are hex, so they must first be converted to decimal.

Examples using Unicode Character "⎈" (U+2388):

(char-to-string ?\u2388)
#+RESULTS:
(char-to-string ?\x2388)
#+RESULTS:
"?\u2388"
#+RESULTS:
?⎈
"?\N{HELM SYMBOL}"
#+RESULTS:
?⎈

Convert Unicode name to character

In a string, the "?\N{NAME}" syntax allows you to specify a Unicode character by its name:

"?\N{LATIN SMALL LETTER A WITH GRAVE}"
#+RESULTS:
"?à"

Encode a string

Encoding a string means translating a string of Unicode code points (integers) into new integers, according to some encoding scheme (like UTF-8). This is necessary in order to tell where one number ends and the next begins:

1224

Is that the single character 1224: "ӈ"? Or two characters 12 and 24? Or.. something else? UTF-8 encodes strings into a binary form that can be unambiguously reversed (decoded) back to Unicode code points.

Viewing encoded strings is sometimes difficult, because the binary form of a string is automatically decoded in order to be displayed.

(encode-coding-string "naïve" 'utf-8 t)
#+RESULTS:
"nai\314\210ve"

encode-coding-string returns a string where any characters outside of <what range?> are escaped (WHY?)

(encode-coding-string "\u0073" 'utf-8)
#+RESULTS:
s
(encode-coding-string "\U0001F638" 'utf-8)
#+RESULTS:
"\360\237\230\270"

toggle-enable-multibyte-characters

Another way to see this is to write multibyte strings to a file, then run M-x toggle-enable-multibyte-characters.

Decode

Decoding a string returns the multibyte equivalent of the string.

(decode-coding-string "nai\314\210ve" 'utf-8)
#+RESULTS:
naïve
(decode-coding-string "\360\237\230\270" 'utf-8)
#+RESULTS:
😸

Code Points

Range is 0 to #x10FFFF (hex).

Emacs extends this with range #x110000 to #x3FFFF

A character codepoint in Emacs is a 22-bit *.

Decode integer a string

Unicode name

(char-from-name "LATIN SMALL LETTER A WITH GRAVE")
#+RESULTS:
224

Unicode number

Unicode number as decimal:

(char-to-string 128568)
#+RESULTS:
😸

As hex

(char-to-string ?\x1F638)
#+RESULTS:
128568

As octal

?\340
#+RESULTS:
224

Normalize a string

(ucs-normalize-NFD-string "nai\u0308ve")
#+RESULTS:
"naïve"

Elisp

A character in Emacs Lisp is nothing more than an integer:

Characters in strings and buffers are currently limited to the range of 0 to 4194303—twenty two bits

https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Type.html

Unicode table

The variable ucs-names (in mule.el) holds a hash table.

The function ucs-names returns the fully expanded table of unicode data.

Datasets

Resources

TODO

Does a character evaluate to different numbers under different coding systems?

Can Emacs interpret a byte array as characters? Example: an IPv4 address, which is often represented as 4 bytes.