SQL, Chapter 12: Indexes and CREATE INDEX

Indexes speed up searches for data. While they are very effective at doing this, they also represent an extra operation when inserting, updating and deleting data, since every table update also requires an update of the index.

Sequential Access

To find a row that meets some set of criteria in a single, unsorted column of raw data, it is necessary to look at the rows one after the other until the row meeting the criteria is found. This is called This is called a sequential scan. Sequential scans are slow and scale poorly.

Indexed Access

An index allows us to categorize rows in some way, and quickly narrow down our search based on that categorization. As an example, suppose we have a phone book and want to look up a number for someone named Jane Smith. We open the book a little after the middle and find the name Peterson. Then, we flip about half the remaining pages over and find Szymanski. Next, we go back about 10 pages and find David Smith. Two pages later, we find Jane Smith.

This is very similar to the way that database indexes work.

A Use Case for an Index

Suppose I work for a large corporation, let’s say with 250,000 employees. It’s my job to maintain the employee phone and email directory.

My employee table has this structure:

***Employees***
Column	Type	Constraints
id	integer	PRIMARY KEY
last_name	text
first_name	text
phone	text
email	text

I want to be able to provide the ability to look up phone numbers and emails. I can already do this easily if I know the id number:


SELECT last_name, first_name, phone, email
  FROM Employees
 WHERE id = 221000

SELECT last_name, first_name, phone, email

FROM Employees

WHERE id = 221000

This will return the name, phone and email almost instantaneously. But if I try this:


SELECT last_name, first_name, phone, email
  FROM Employees
 WHERE last_name = 'Rodes' AND first_name = 'Robert'

SELECT last_name, first_name, phone, email

FROM Employees

WHERE last_name = 'Rodes' AND first_name = 'Robert'

This will return the same data, but it will take perhaps as long as a minute.

The reason for the discrepancy is that primary key fields are automatically indexed, so the id lookup can use indexed access. However, the last_name and first_name fields are not indexed, so sequential access is necessary. The DBMS must scan the entire table, checking each row one by one, to see whether the name fields both match.

We’ll say that the lookup requirement specifies that lookup by name is necessary, this is a good candidate for an index. We can create a composite index that sorts by these two fields.

`CREATE INDEX`

SQL uses CREATE INDEX to create an index. `CREATE INDEX has this basic syntax:

CREATE [ UNIQUE ] INDEX [ name ] ON table_name
( column_name [, …] )
[ WHERE predicate ]

Where:

UNIQUE requires each row in the index to be unique with respect to the specified set of column_names.
name is a valid index name. To be valid, it must be different from all existing relations in the database, whether indexes, sequences, tables, etc. If no name is given, CREATE INDEX will construct one.
table_name is an existing table in the database.
column_name is an existing column in table_name.
predicate is the constraint expression for a partial index.

How DBMSs Implement Indexes

SQL indexes are stored as separate tables. An index has one or more fields to be indexed, and the primary key of the row that contains the matching row values in the table that the index is indexing. For example, in our Employees table, an index might contain the last_name and first_name fields. Those values would be sorted as they are in a phone book, and each row would also have the primary key of the field in the Employees table.

Since our table might contain several employees with the same name, an index on the name is a non-unique index. If there were multiple employees named “John Smith”, then the index would sort them all together. So, this SELECT statement:


SELECT id, last_name, first_name, phone, email
  FROM Employees
 WHERE last_name = 'Smith' AND first_name = 'John'

SELECT id, last_name, first_name, phone, email

FROM Employees

WHERE last_name = 'Smith' AND first_name = 'John'

Returns every row in the Employees table with last_name of “Smith” and first_name of “John”. (Note that each one will have a different id, since the id uniquely identifies each employee in the Employees table, whether they have the same name or not.)

Index Types

There are many types of indexes that DBMSs can use. B-tree indexes are very common, and the default index type in many DBMSs. B-tree indexes work by moving through rows in a manner analogous to a tree, starting at the trunk and moving through progressively smaller branches to find a leaf (which is the desired row).

Unless specified as UNIQUE, an index does not need to be on a column or columns whose values are unique. If a row lookup runs into multiple rows with the same indexed value, it will return all of the rows that match the given search criteria.

Partial Indexes

A partial index is an index that excludes some rows in a table based on the constraint expression specified in the WHERE clause. It is partial, then, because it doesn’t include all the rows in the table.

Suppose, for example, you regularly look up the cities that employees who work remotely work from. There is a location table with a boolean remote field. We might have a query that looks like this:


SELECT l.city, l.state, e.name
FROM locations l
JOIN employees_locations el on l.id = el.loc_id
JOIN employees e on el.emp_id = e.id
WHERE l.remote IS TRUE
ORDER BY city, state;

SELECT l.city, l.state, e.name

FROM locations l

JOIN employees_locations el on l.id = el.loc_id

JOIN employees e on el.emp_id = e.id

WHERE l.remote IS TRUE

ORDER BY city, state;

You would like to index city and state, but you only care about the perhaps 20 percent of the employees who work remotely. If 80 percent of the records in the index are for one of the the three or four office locations, and you don’t need those, having them in the index will slow it down unnecessarily. This is a use case for a partial index.

To create it:


CREATE INDEX
ON locations(city, state)
WHERE remote IS TRUE

CREATE INDEX

ON locations(city, state)

WHERE remote IS TRUE

This will restrict the members of the index to those rows where remote is true. Queries that do not specify this condition will not be able to use the index, but insertions, updates and deletes for records where remote is false won’t require an update of the index, either. So, it’s a tradeoff: better performance for narrower application.

DROP INDEX

The DROP INDEX statement removes an index from a database. The syntax is DROP INDEX index_name, where index_name is the name of the index.

Robert Rodes

Software Developer

Turning ideas into software...

Robert Rodes

SQL, Chapter 12: Indexes and CREATE INDEX

Sequential Access

Indexed Access

A Use Case for an Index

`CREATE INDEX`

How DBMSs Implement Indexes

Index Types

Partial Indexes

DROP INDEX