What’s the difference between a character, a code point, a glyph and a grapheme

stringterminologyunicode

Trying to understand the subtleties of modern Unicode is making my head hurt. In particular, the distinction between code points, characters, glyphs and graphemes – concepts which in the simplest case, when dealing with English text using ASCII characters, all have a one-to-one relationship with each other – is causing me trouble.

Seeing how these terms get used in documents like Matthias Bynens' JavaScript has a unicode problem or Wikipedia's piece on Han unification, I've gathered that these concepts are not the same thing and that it's dangerous to conflate them, but I'm kind of struggling to grasp what each term means.

The Unicode Consortium offers a glossary to explain this stuff, but it's full of "definitions" like this:

Abstract Character. A unit of information used for the organization, control, or representation of textual data. …

…

Character. … (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. …

…

Glyph. (1) An abstract form that represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character.

…

Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system. …

Most of these definitions possess the quality of sounding very academic and formal, but lack the quality of meaning anything, or else defer the problem of definition to yet another glossary entry or section of the standard.

So I seek the arcane wisdom of those more learned than I. How exactly do each of these concepts differ from each other, and in what circumstances would they not have a one-to-one relationship with each other?

Best Answer

Character is an overloaded term that can mean many things.
A code point is the atomic unit of information. Text is a sequence of code points. Each code point is a number which is given meaning by the Unicode standard.
A code unit is the unit of storage of a part of an encoded code point. In UTF-8 this means 8 bits, in UTF-16 this means 16 bits. A single code unit may represent a full code point, or part of a code point. For example, the snowman glyph (☃) is a single code point but 3 UTF-8 code units, and 1 UTF-16 code unit.
A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system. For example, both a and ä are graphemes, but they may consist of multiple code points (e.g. ä may be two code points, one for the base character a followed by one for the diaeresis; but there's also an alternative, legacy, single code point representing this grapheme). Some code points are never part of any grapheme (e.g. the zero-width non-joiner, or directional overrides).
A glyph is an image, usually stored in a font (which is a collection of glyphs), used to represent graphemes or parts thereof. Fonts may compose multiple glyphs into a single representation, for example, if the above ä is a single code point, a font may choose to render that as two separate, spatially overlaid glyphs. For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this work. A font may contain multiple alternative glyphs for the same grapheme, too.

Model-View-Presenter

In MVP, the Presenter contains the UI business logic for the View. All invocations from the View delegate directly to the Presenter. The Presenter is also decoupled directly from the View and talks to it through an interface. This is to allow mocking of the View in a unit test. One common attribute of MVP is that there has to be a lot of two-way dispatching. For example, when someone clicks the "Save" button, the event handler delegates to the Presenter's "OnSave" method. Once the save is completed, the Presenter will then call back the View through its interface so that the View can display that the save has completed.

MVP tends to be a very natural pattern for achieving separated presentation in WebForms. The reason is that the View is always created first by the ASP.NET runtime. You can find out more about both variants.

Two primary variations

Passive View: The View is as dumb as possible and contains almost zero logic. A Presenter is a middle man that talks to the View and the Model. The View and Model are completely shielded from one another. The Model may raise events, but the Presenter subscribes to them for updating the View. In Passive View there is no direct data binding, instead, the View exposes setter properties that the Presenter uses to set the data. All state is managed in the Presenter and not the View.

Pro: maximum testability surface; clean separation of the View and Model
Con: more work (for example all the setter properties) as you are doing all the data binding yourself.

Supervising Controller: The Presenter handles user gestures. The View binds to the Model directly through data binding. In this case, it's the Presenter's job to pass off the Model to the View so that it can bind to it. The Presenter will also contain logic for gestures like pressing a button, navigation, etc.

Pro: by leveraging data binding the amount of code is reduced.
Con: there's a less testable surface (because of data binding), and there's less encapsulation in the View since it talks directly to the Model.

Model-View-Controller

In the MVC, the Controller is responsible for determining which View to display in response to any action including when the application loads. This differs from MVP where actions route through the View to the Presenter. In MVC, every action in the View correlates with a call to a Controller along with an action. In the web, each action involves a call to a URL on the other side of which there is a Controller who responds. Once that Controller has completed its processing, it will return the correct View. The sequence continues in that manner throughout the life of the application:

    Action in the View
        -> Call to Controller
        -> Controller Logic
        -> Controller returns the View.

One other big difference about MVC is that the View does not directly bind to the Model. The view simply renders and is completely stateless. In implementations of MVC, the View usually will not have any logic in the code behind. This is contrary to MVP where it is absolutely necessary because, if the View does not delegate to the Presenter, it will never get called.

Presentation Model

One other pattern to look at is the Presentation Model pattern. In this pattern, there is no Presenter. Instead, the View binds directly to a Presentation Model. The Presentation Model is a Model crafted specifically for the View. This means this Model can expose properties that one would never put on a domain model as it would be a violation of separation-of-concerns. In this case, the Presentation Model binds to the domain model and may subscribe to events coming from that Model. The View then subscribes to events coming from the Presentation Model and updates itself accordingly. The Presentation Model can expose commands which the view uses for invoking actions. The advantage of this approach is that you can essentially remove the code-behind altogether as the PM completely encapsulates all of the behavior for the view. This pattern is a very strong candidate for use in WPF applications and is also called Model-View-ViewModel.

There is a MSDN article about the Presentation Model and a section in the Composite Application Guidance for WPF (former Prism) about Separated Presentation Patterns

C# – the difference between String and string in C#

string is an alias in C# for System.String.
So technically, there is no difference. It's like int vs. System.Int32.

As far as guidelines, it's generally recommended to use string any time you're referring to an object.

e.g.

string place = "world";

Likewise, I think it's generally recommended to use String if you need to refer specifically to the class.

e.g.

string greet = String.Format("Hello {0}!", place);

This is the style that Microsoft tends to use in their examples.

It appears that the guidance in this area may have changed, as StyleCop now enforces the use of the C# specific aliases.