Indexes speed up searches for data. While they are very effective at doing this, they also represent an extra operation when inserting, updating and deleting data, since every table update also requires an update of the index.
Sequential Access
To find a row that meets some set of criteria in a single, unsorted column of raw data, it is necessary to look at the rows one after the other until the row meeting the criteria is found. This is called This is called a sequential scan. Sequential scans are slow and scale poorly.
Indexed Access
An index allows us to categorize rows in some way, and quickly narrow down our search based on that categorization. As an example, suppose we have a phone book and want to look up a number for someone named Jane Smith. We open the book a little after the middle and find the name Peterson. Then, we flip about half the remaining pages over and find Szymanski. Next, we go back about 10 pages and find David Smith. Two pages later, we find Jane Smith.
This is very similar to the way that database indexes work.
A Use Case for an Index
Suppose I work for a large corporation, let’s say with 250,000 employees. It’s my job to maintain the employee phone and email directory.
My employee table has this structure:
Column | Type | Constraints |
---|---|---|
id | integer | PRIMARY KEY |
last_name | text | |
first_name | text | |
phone | text | |
text |
I want to be able to provide the ability to look up phone numbers and emails. I can already do this easily if I know the id
number:
1 2 3 4 5 |
SELECT last_name, first_name, phone, email FROM Employees WHERE id = 221000 |
This will return the name, phone and email almost instantaneously. But if I try this:
1 2 3 4 5 |
SELECT last_name, first_name, phone, email FROM Employees WHERE last_name = 'Rodes' AND first_name = 'Robert' |
This will return the same data, but it will take perhaps as long as a minute.
The reason for the discrepancy is that primary key fields are automatically indexed, so the id
lookup can use indexed access. However, the last_name
and first_name
fields are not indexed, so sequential access is necessary. The DBMS must scan the entire table, checking each row one by one, to see whether the name fields both match.
We’ll say that the lookup requirement specifies that lookup by name is necessary, this is a good candidate for an index. We can create a composite index that sorts by these two fields.
CREATE INDEX
SQL uses CREATE INDEX
to create an index. `CREATE INDEX has this basic syntax:
( column_name [, …] )
[ WHERE predicate ]
Where:
UNIQUE
requires each row in the index to be unique with respect to the specified set of column_names.- name is a valid index name. To be valid, it must be different from all existing relations in the database, whether indexes, sequences, tables, etc. If no name is given,
CREATE INDEX
will construct one. - table_name is an existing table in the database.
- column_name is an existing column in table_name.
- predicate is the constraint expression for a partial index.
How DBMSs Implement Indexes
SQL indexes are stored as separate tables. An index has one or more fields to be indexed, and the primary key of the row that contains the matching row values in the table that the index is indexing. For example, in our Employees
table, an index might contain the last_name
and first_name
fields. Those values would be sorted as they are in a phone book, and each row would also have the primary key of the field in the Employees table.
Since our table might contain several employees with the same name, an index on the name is a non-unique index. If there were multiple employees named “John Smith”, then the index would sort them all together. So, this SELECT
statement:
1 2 3 4 5 |
SELECT id, last_name, first_name, phone, email FROM Employees WHERE last_name = 'Smith' AND first_name = 'John' |
Returns every row in the Employees
table with last_name
of “Smith” and first_name
of “John”. (Note that each one will have a different id
, since the id
uniquely identifies each employee in the Employees
table, whether they have the same name or not.)
Index Types
There are many types of indexes that DBMSs can use. B-tree indexes are very common, and the default index type in many DBMSs. B-tree indexes work by moving through rows in a manner analogous to a tree, starting at the trunk and moving through progressively smaller branches to find a leaf (which is the desired row).
Unless specified as UNIQUE
, an index does not need to be on a column or columns whose values are unique. If a row lookup runs into multiple rows with the same indexed value, it will return all of the rows that match the given search criteria.
Partial Indexes
A partial index is an index that excludes some rows in a table based on the constraint expression specified in the WHERE
clause. It is partial, then, because it doesn’t include all the rows in the table.
Suppose, for example, you regularly look up the cities that employees who work remotely work from. There is a location
table with a boolean remote
field. We might have a query that looks like this:
1 2 3 4 5 6 7 8 |
SELECT l.city, l.state, e.name FROM locations l JOIN employees_locations el on l.id = el.loc_id JOIN employees e on el.emp_id = e.id WHERE l.remote IS TRUE ORDER BY city, state; |
You would like to index city and state, but you only care about the perhaps 20 percent of the employees who work remotely. If 80 percent of the records in the index are for one of the the three or four office locations, and you don’t need those, having them in the index will slow it down unnecessarily. This is a use case for a partial index.
To create it:
1 2 3 4 5 |
CREATE INDEX ON locations(city, state) WHERE remote IS TRUE |
This will restrict the members of the index to those rows where remote
is true
. Queries that do not specify this condition will not be able to use the index, but insertions, updates and deletes for records where remote
is false
won’t require an update of the index, either. So, it’s a tradeoff: better performance for narrower application.
DROP INDEX
The DROP INDEX
statement removes an index from a database. The syntax is DROP INDEX index_name
, where index_name is the name of the index.