Unicode in Emacs

Characters

Emacs lisp has a special read syntax for characters: the ? question mark. Such syntax is necessary to distinguish the character à from the symbol à.

Converting Between Chars and Integers

(string-to-char "à")

#+RESULTS:

(char-to-string 224)

#+RESULTS:

à

The argument to char-to-string must be a character, so it can either be given as:

A character using ? syntax
An integer, using any of the ways to express an integer

For example, as an integer using hex notation:

(char-to-string #xe0)

#+RESULTS:

à

Or, as a character using hex notation:

(char-to-string ?\xe0)

#+RESULTS:

à

The reader syntax ? simply allows you to express a number using a character. Sometimes that is very helpful and provides clarity (when you are dealing with characters) and sometims that would be silly (when dealing with numbers).

The elisp manual notes:

whether an integer is a character or not is determined only by how it is used

(+ ?à 2) ; Usually not helpful

#+RESULTS:

(make-string 5 ?à) ; Helpful

#+RESULTS:

ààààà

Which, by the way, could also be written as:

(make-string 5 224) ; Not helpful

#+RESULTS:

ààààà

In Emacs, all characters are integers, but not all integers are characters. A character's corresponding integer is simply the Unicode number (i.e. the Unicode code point) of the character.

Unicode code points are defined from the integers #x0 to

(char-to-string (+ 1 #x10FFFF))

The ? read syntax for a character also allows for the character to be expressed as a hex number or an octal number. As before, the expression evaluates to a decimal number which represents the character's code point.

?\xe0

#+RESULTS:

?\340

#+RESULTS:

TODO: wtf is this?

The function make-char returns an integer representing the character at the given position in a… what?

(make-char 'unicode 0 0 224 0)

#+RESULTS:

Unicode Escape Sequences

A character can be defined using a unicode escape sequence. There are two forms for Unicode escape sequences:

\uXXXX (\u and four hex digits)
\U00XXXXXX (\U00 and six hex digits)

Evaluating a character with a Unicode escape sequence returns an integer:

?\u00e0

#+RESULTS:

?\U0001F638

#+RESULTS:

Render the character using char-to-string:

(char-to-string ?\u00e0)

#+RESULTS:

à

Also, evaluating a string with a Unicode escape sequence returns a string:

"\U0001F638"

#+RESULTS:

😸

Convert Unicode Code Point to Character

The function (char-to-string CHAR) returns the character at the decimal code point CHAR. Unicode code points in the "U+2388" format are hex, so they must first be converted to decimal.

Examples using Unicode Character "⎈" (U+2388):

(char-to-string ?\u2388)

#+RESULTS:

⎈

(char-to-string ?\x2388)

#+RESULTS:

⎈

"?\u2388"

#+RESULTS:

?⎈

"?\N{HELM SYMBOL}"

#+RESULTS:

?⎈

Convert Unicode name to character

In a string, the "?\N{NAME}" syntax allows you to specify a Unicode character by its name:

"?\N{LATIN SMALL LETTER A WITH GRAVE}"

#+RESULTS:

"?à"

Encode a string

Encoding a string means translating a string of Unicode code points (integers) into new integers, according to some encoding scheme (like UTF-8). This is necessary in order to tell where one number ends and the next begins:

Is that the single character 1224: "ӈ"? Or two characters 12 and 24? Or.. something else? UTF-8 encodes strings into a binary form that can be unambiguously reversed (decoded) back to Unicode code points.

Viewing encoded strings is sometimes difficult, because the binary form of a string is automatically decoded in order to be displayed.

(encode-coding-string "naïve" 'utf-8 t)

#+RESULTS:

"nai\314\210ve"

encode-coding-string returns a string where any characters outside of <what range?> are escaped (WHY?)

(encode-coding-string "\u0073" 'utf-8)

#+RESULTS:

(encode-coding-string "\U0001F638" 'utf-8)

#+RESULTS:

"\360\237\230\270"

toggle-enable-multibyte-characters

Another way to see this is to write multibyte strings to a file, then run M-x toggle-enable-multibyte-characters.

Decode

Decoding a string returns the multibyte equivalent of the string.

(decode-coding-string "nai\314\210ve" 'utf-8)

#+RESULTS:

naïve

(decode-coding-string "\360\237\230\270" 'utf-8)

#+RESULTS:

😸

Code Points

Range is 0 to #x10FFFF (hex).

Emacs extends this with range #x110000 to #x3FFFF

A character codepoint in Emacs is a 22-bit *.

Decode integer a string

Unicode name

(char-from-name "LATIN SMALL LETTER A WITH GRAVE")

#+RESULTS:

Unicode number

Unicode number as decimal:

(char-to-string 128568)

#+RESULTS:

😸

As hex

(char-to-string ?\x1F638)

#+RESULTS:

As octal

?\340

#+RESULTS:

Normalize a string

(ucs-normalize-NFD-string "nai\u0308ve")

#+RESULTS:

"naïve"

Elisp

A character in Emacs Lisp is nothing more than an integer:

Characters in strings and buffers are currently limited to the range of 0 to 4194303—twenty two bits

https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Type.html

Unicode in Emacs

Characters

Characters are Integers

Converting Between Chars and Integers

Why Have Two Representations?

Unicode Escape Sequences

Convert Unicode Code Point to Character

Convert Unicode name to character

Encode a string

toggle-enable-multibyte-characters

Decode

Code Points

Unicode name

Unicode number

Normalize a string

Elisp

Unicode table

Datasets

Resources

TODO