Sql – Understanding clustered index

indexingsqlsql server

Since PostgreSQL doesn't support clustered indexes, I'm considering MSSQL server. I've read the article comparing clustered and non-clustered indexes. The gist of the article is that (emphasize mine):

Non clustered indexes store both a value and a pointer to the actual
row that holds that value.

And

Clustered indexes don’t need to store a pointer to the actual row
because of the fact that the rows in the table are stored on disk in
the same exact order as the clustered index

As I was told there and there it was very difficult to support the physical ordering of the table's data, especially if the table is splitted among multiple drives. And now, I meet the clustered index concept assuming that data stored in some order physically. This's what I was confused by.

Question: What is the clustered index structure? Does it support tree-like structure to traverse over, like PosgtreSQL does for btree indexes?

Best Answer

In SQL Server, indexes are organized as B-trees. Each page in an index B-tree is called an index node. The top node of the B-tree is called the root node. The bottom level of nodes in the index is called the leaf nodes. Any index levels between the root and the leaf nodes are collectively known as intermediate levels. In a clustered index, the leaf nodes contain the data pages of the underlying table. The root and intermediate level nodes contain index pages holding index rows. Each index row contains a key value and a pointer to either an intermediate level page in the B-tree, or a data row in the leaf level of the index. The pages in each level of the index are linked in a doubly-linked list.

Clustered indexes have one row in sys.partitions, with index_id = 1 for each partition used by the index. By default, a clustered index has a single partition. When a clustered index has multiple partitions, each partition has a B-tree structure that contains the data for that specific partition. For example, if a clustered index has four partitions, there are four B-tree structures; one in each partition.

for ref.

https://technet.microsoft.com/en-us/library/ms177443(v=sql.105).aspx http://www.sqlservercentral.com/blogs/practicalsqldba/2013/03/14/sql-server-part-4-explaining-the-non-clustered-index-structure-/

Caveats follow

Note that while this is perhaps the cleanest way to answer the question as asked, index is a rather weak component of the list API, and I can't remember the last time I used it in anger. It's been pointed out to me in the comments that because this answer is heavily referenced, it should be made more complete. Some caveats about list.index follow. It is probably worth initially taking a look at the documentation for it:

list.index(x[, start[, end]])
Return zero-based index in the list of the first item whose value is equal to x. Raises a ValueError if there is no such item.

The optional arguments start and end are interpreted as in the slice notation and are used to limit the search to a particular subsequence of the list. The returned index is computed relative to the beginning of the full sequence rather than the start argument.

Linear time-complexity in list length

An index call checks every element of the list in order, until it finds a match. If your list is long, and you don't know roughly where in the list it occurs, this search could become a bottleneck. In that case, you should consider a different data structure. Note that if you know roughly where to find the match, you can give index a hint. For instance, in this snippet, l.index(999_999, 999_990, 1_000_000) is roughly five orders of magnitude faster than straight l.index(999_999), because the former only has to search 10 entries, while the latter searches a million:

>>> import timeit
>>> timeit.timeit('l.index(999_999)', setup='l = list(range(0, 1_000_000))', number=1000)
9.356267921015387
>>> timeit.timeit('l.index(999_999, 999_990, 1_000_000)', setup='l = list(range(0, 1_000_000))', number=1000)
0.0004404920036904514

Only returns the index of the first match to its argument

A call to index searches through the list in order until it finds a match, and stops there. If you expect to need indices of more matches, you should use a list comprehension, or generator expression.

>>> [1, 1].index(1)
0
>>> [i for i, e in enumerate([1, 2, 1]) if e == 1]
[0, 2]
>>> g = (i for i, e in enumerate([1, 2, 1]) if e == 1)
>>> next(g)
0
>>> next(g)
2

Most places where I once would have used index, I now use a list comprehension or generator expression because they're more generalizable. So if you're considering reaching for index, take a look at these excellent Python features.

Throws if element not present in list

A call to index results in a ValueError if the item's not present.

>>> [1, 1].index(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: 2 is not in list

If the item might not be present in the list, you should either

Check for it first with item in my_list (clean, readable approach), or
Wrap the index call in a try/except block which catches ValueError (probably faster, at least when the list to search is long, and the item is usually present.)

Python – How to remove an element from a list by index

Use del and specify the index of the element you want to delete:

>>> a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> del a[-1]
>>> a
[0, 1, 2, 3, 4, 5, 6, 7, 8]

Also supports slices:

>>> del a[2:4]
>>> a
[0, 1, 4, 5, 6, 7, 8, 9]

Here is the section from the tutorial.

Best Answer

Related Solutions

Python – Finding the index of an item in a list

Caveats follow

Linear time-complexity in list length

Only returns the index of the first match to its argument

Throws if element not present in list

Python – How to remove an element from a list by index

Related Topic