XML Type Safety – Why is XML Type Safe?

type-safetyxml

Why do they say that XML provides type safety and how is it expressed in the XML itself?

How is it different from JSON (for example) which (as I understand) is not type safe?

Best Answer

Because of the XML Schema Definition (XSD).

With XML, you can have an additional file which describes the schema. It indicates, for example, that the element /a/b is an array and contains from 1 to 10 elements, or that the element /a/c is an integer. You can find an example of an XSD here.

Validation of a given XML file through an XSD is supported by many languages. For example, a .NET application may request an XML file from an untrusted source and check that it matches the XSD; then, it can save it to a Microsoft SQL Server database, which can in turn contains an XSD and do the check again (to ensure that any client which have access to the database complies).

XSD is not the only language.

If you've done web development, you certainly heard about Document Type Definition (DTD)—a markup language which defines the structure of XML and is used especially in validation of HTML-related content. While it cannot do all things XSD can, such as ensure that an element or an attribute contains an integer number, it can still perform a bunch of structure checks.
RELAX NG has a benefit of being relatively simple compared to other languages and can be written in a more compact form than XML.
Schematron is another “rule-based validation language for making assertions about the presence or absence of patterns in XML trees” (Wikipedia) and presents a slightly different approach, based on XPath assertions.

Similar initiatives for JSON are not that popular (especially, I believe, in Microsoft-centric corporate world). One of the reasons is that JSON is intended for situations where the data structure is rather basic (i.e. can be expressed as a tree, without the need for attributes, for instance) and don't necessarily need to be validated. An excellent example is a REST API used by a dynamically-typed language:

the client is very easy and fast to implement,
the API is trusted not to change,
the client can easily deal with specific leafs where validation is necessary (for instance check that /something/percentage is an actual number and is in 0..100 range).

Type systems prevent errors

Type systems eliminates illegal programs. Consider the following Python code.

 a = 'foo'
 b = True
 c = a / b

In Python, this program fails; it throws an exception. In a language like Java, C#, Haskell, whatever, this isn't even a legal program. You entirely avoid these errors because they simply aren't possible in the set of input programs.

Similarly, a better type system rules out more errors. If we jump up to super advanced type systems we can say things like this:

 Definition divide x (y : {x : integer | x /= 0}) = x / y

Now the type system guarantees that there aren't any divide-by-0 errors.

What sort of errors

Here's a brief list of what errors type systems can prevent

Out-of-range errors
SQL injection
Generalizing 2, many safety issues (what taint checking is for in Perl)
Out-of-sequence errors (forgetting to call init)
Forcing a subset of values to be used (for example, only integers greater than 0)
~~Nefarious kittens~~ (Yes, it was a joke)
Loss-of-precision errors
Software transactional memory (STM) errors (this needs purity, which also requires types)
Generalizing 8, controlling side effects
Invariants over data structures (is a binary tree balanced?)
Forgetting an exception or throwing the wrong one

And remember, this is also at compile time. No need to write tests with 100% code coverage to simply check for type errors, the compiler just does it for you :)

Case study: Typed lambda calculus

Alright, let's examine the simplest of all type systems, simply typed lambda calculus.

Basically there are two types,

Type = Unit | Type -> Type

And all terms are either variables, lambdas, or application. Based on this, we can prove that any well typed program terminates. There is never a situation where the program will get stuck or loop forever. This isn't provable in normal lambda calculus because well, it isn't true.

Think about this, we can use type systems to guarentee that our program doesn't loop forever, rather cool right?

Detour into dynamic types

Dynamic type systems can offer identical guarantees as static type systems, but at runtime rather than compile time. Actually, since it's runtime, you can actually offer more information. You lose some guarantees however, particularly about static properties like termination.

So dynamic types don't rule out certain programs, but rather route malformed programs to well-defined actions, like throwing exceptions.

TLDR

So the long and the short of it, is that type systems rule out certain programs. Many of the programs are broken in some way, therefore, with type systems we avoid these broken programs.

Database Management – Moving Data Between Databases with XML

XML has two very important attributes that make it attractive for data transfer between heterogenous systems:

You can pass it through firewalls, and
You can usually find reader/writer libraries already written to create and parse it.

If you're looking for something less verbose that still has both of these attributes, you can try using JSON.

If you're simply transferring data between two homogeneous databases on the same network, there are probably easier ways. For example, Microsoft SQL Server has at least three different ways to transfer data between databases: Bulk Insert, SSIS, and Replication.