Changing a data structure in a slow program can work the same way an organ transplant does in a sick patent. Important classes of abstract data types such as containers, dictionaries, and priority queues, have many different but functionally equivalent data structures that implement them.

## Continuous vs. Linked Data Structures

- Contiguously-allocated structures are composed of single slabs of memory and include arrays, matrices, heaps, and hash tables.
- Linked data structures are composed of distinct chunks of memory bound together by pointers, and include lists, trees, and graph adjacency lists.

### Arrays

Arrays are structures of **fixed-size** data records such that each element can be efficiently located by its index or (equivalently) address.

Advantages of contiguously-allocated arrays include:

- Constant-time access given the index (random access)
- Space efficiency
- Memory locality – Physical continuity between successive data accesses helps exploit the high-speed
*cache memory*on modern computer architectures.

The downside of arrays is that we cannot adjust their size in the middle of a program’s execution.

Actually, we can efficiently enlarge arrays as we need them, through the miracle of *dynamic arrays*. Suppose we start with an array of size 1 and double its size from m to 2m each time we run out of space. This doubling process involves allocating to the lower half of the new one and returning the space used by the old array to the storage allocation system.

The primary thing lost using dynamic arrays is the guarantee that each array access takes constant time *in the worst case*. Now all the queries will be fast, except for those relatively few queries triggering array doubling. What we get instead is a promise that the *n*th array access will be completed quickly enough that the *total* effort expended so far will still be O(n). Such *amortized* guarantees arise frequently in the analysis of data structures.

### Pointers and Linked Structures

*Pointers* are the connections that hold the pieces of linked structures together. A variable storing a pointer to a given data item can provide more freedom than storing a copy of the item itself.

Pointer syntax and power differ significantly across programming languages, so we begin with a quick review of pointers in C language.

[cce_c]typedef struct list { item_type item; /* data item */ struct list *next; /* point to successor */ } list;[/cce_c]- Each node in our data structure (here list) contains one or more data fields (here item) that retain the data that we need to store.
- Each node contains a pointer field to at least one other node (here next).
- Finally, we need a pointer to the head of the structure, so we know where to access it.

#### Searching a List

[cce_c]void search_list(list **l, item_type x) { if (l == NULL) return(NULL); if (l->item == x) return(l); else return( search_list(l->next, x) ); }[/cce_c]The list is the simplest linked structure. The three basic operations supported by lists are searching, insertion, and deletion.

#### Insertion into a List

[cce_c]void insert_list(list **l, item_type x) { list *p; /* temporary pointer */ p = malloc( sizeof(list) ); p->item = x; p->next = *l; *l = p; }[/cce_c]#### Deletion From a List

Deletion from a linked list is somewhat more complicated. First, we must find the pointer to the predecessor of the item to be deleted. We do this recursively:

[cce_c]list *predecessor_list(list *l, item_type x) { if ((l == NULL) || (l->next == NULL)) { // predecessor sought on null list return(NULL); } if ((l->next)->item == x) return(l); else return( predecessor_list(l->next, x) ); }[/cce_c]The predecessor is needed because it points to the doomed node, so its next pointer must be changed. Special care must be taken to rest the pointer to the head of the list(l) when the first element is deleted:

[cce_c]delete_list(list **l, item_type x) { list *p; list *pred; list *search_list(), *predecessor_list(); p = search_list(*l, x); if (p != NULL) { pred = predecessor_list(*l, x); if (pred == NULL) /* splice out of list */ *l = p->next; else pred->next = p->next; free(p); /* free memory used by node */ } }[/cce_c]#### Comparison

The relative advantages of linked lists over static arrays include:

- Overflow on linked structures can never occur unless the memory is actually full.
- Insertions and deletions are simpler than for contiguous (array) lists.
- With large records, moving pointers is easier and faster than moving the items themselves.

While the relative advantages of arrays include:

- Linked structures require extra space for storing pointer fields.
- Linked lists do not allow efficient random access to items.

Our final though about these fundamental structures is that they can be thought of as recursive objects:

- Lists – Chopping the first element off a linked list leaves a smaller linked list.
- Arrays – Splitting the first k elements off of an n element array gives two smaller arrays, of size k and n – k, respectively.

This insight leads to simpler list processing, and efficient divide-and-conquer algorithms such as quicksort and binary search.

## Stacks and Queues

We use the term container to denote a data structure that permits storage and retrieval of data independent of content. By contrast, dictionaries are abstract data types that retrieve based on key values or contents.

Containers are distinguished by the particular retrieval order they support. In the two most important types of containers, this retrieval order depends on the insertion order:

- Stacks – Support retrieval by last-in, first-out (LIFO) order. Stacks are simple to implement and very efficient. For this reason, stacks are probably the rigth container to use when retrieval order doesn’t matter at all, such as when processing batch jobs. The put and get operations for stacks are usually called push and pop:

LIFO order arises in many real-world contexts. Algorithmically, LIFO tends to happen in the course of executing recursive algorithms.

- Queues – Support retrieval in the first in, first out (FIFO) order. This is surely the fairest way to be processed in FIFO order to control waiting times for services. You want the container holding jobs to be processed in FIFO order to minimize the maximum time spent waiting.

Queues are somewhat tricker to implement than stacks and thus are most appropriate for applications (like certain simulations) where th order is important. The put and get opeartions for queues are usually called enqueue and dequeue.

- Enqueue(x, q):
- Dequeue(q):

Stacks and queues can be effectively implemented using either arrays or linked lists. The key issue is whether an upper bound on the size of the container is known in advance, thus permitting the use of a statically-allocated array.

## Dictionaries

The dictionary data type permits access to data items by content. You stick an item into a dictionary so you can find it when you need it.

The primary operations of dictionary support are:

- Search(D, k) – Given a search key k, return a pointer to the element in dictionary D whose key value is k, if one exists.
- Insert(D, x) – Given a data item x, add it to the set in the dictionary D.
- Delete(D, x) – Given a pointer to a given data item x in the dictionary D, remove it from D.

Certain dictionary data structures also efficiently support other useful operations:

- Max(D) or Min(D) – Retrieve the item with largest (or smallest) key from D.
- Predecessor(D, x) or Successor(D, x) – Retrieve the item from D whose key is immediately before (or after) x in sorted order.

By defining such problems in terms of abstract dictionary operations, we avoid the details of the data structure’s representation and focus on the task at hand.

#### Stop and Think: Comparing Dictionary Implementations (I)

*Problem:* What are the asymptotic worst-case running times fro each of the seven fundamental operations when the data structure is implemented as:

- An unsorted array.
- A sorted array.

#### Stop and Think: Comparing Dictionary Implementations (II)

*Problem:* What is the asymptotic worst-case running times for each of the seven fundamental dictionary operations when the data sturcutre is implemented as

- A singly-linked unsorted list.
- A doubly-linked unsorted list.
- A singly-linked sorted list.
- A doubly-linked sorted list.

## Binary Search Trees

We have seen data structures that allow fast search or flexible update, but not fast search and flexible update.

Binary search requires that we have fast access to two elements — specifically the median elements above the below the given node. To combine these ideas, we need a “linked list” with two pointers per node. This is the basic idea behind binary search trees.

A rooted binary tree is recursively defined as either being (1) empty, or (2) consisting of a node called the root, together with two rooted binary trees called the left and right subtrees, respectively.

A binary search tree labels each node in a binary tree with a single key such that for any node labeled x, all nodes in the left subtree of x have keys < x while all nodes in the right subtree of x have keys > x. This search tree labeling scheme is very special. For any binary tree on n nodes, and any set of n keys, there is eactly one labeling that makes it a binary search tree.

### Implementing Binary Search Trees

Binary tree nodes have left and right pointer fields, an (optional) parent pointer, and a data field.

[cc_c]typedef struct tree { item_type item; /* data item */ struct tree *parent; /* pointer to parent */ struct tree *left; /* pointer to left child */ struct tree *right; /* pointer to right child */ } tree;[/cc_c]The basic operations supported by binary trees are searching, traversal, insertion, and deletion.

#### Searching in a Tree

This recursive structure yields the recursive search algorithm below:

[cc_c]tree *search_tree(tree *l, item_type x) { if (l == NULL) return(NULL); if (l->item == x) return(l); if (x < l->item) return( search_tree(l->left, x) ); else return( search_tree(l->right, x) ); }[/cc_c]This search algorithm runs in O(h) time, where h denotes the height of the tree.

#### Finding Minimum and Maximum Elements in a Tree

[cce_c] tree *find_minimum(tree *t) { tree *min; /* pointer to minimum */ if (t == NULL) return(NULL); min = t; while(min->left != NULL) min = min->left; return(min); }[/cce_c]#### Traversal in a Tree

Visiting all the nodes in a rooted binary tree proves to be an important component of many algorithms. It’s a special case of traversing all nodes and edges in a graph.

Thus, visiting the nodes recursively in accord with such a policy produces an *in-order* traversal of the search tree:

Alternative traversal orders come from changing the position of process_item relative to the traversals of the left and right subtrees. Processing the item first yields a *pre-order* traversal, while processing it last gives a *post-order* traversal. These make relatively little sense with search trees, but prove useful when the rooted tree represents arithmetic or logical expressions.

#### Insertion in a Tree

There is only one place to insert an item x into a binary search tree T where we know we can find it again. We must replace the [cci_c]NULL[/cci_c] pointer found in T after an unsuccessful query for the key k.

This implementation uses recursion to combine the search and node insertion stages of key insertion. The tree arguments to [cci_c]insert_node[/cci_c] are (1) a pointer [cci_c]l[/cci_c] to the pointer linking the search subtree to the rest of the tree, (2) the key [cci_c]x[/cci_c] to be inserted, and (3) a [cci_c]parent[/cci_c] pointer to the parent node containing [cci_c]l[/cci_c]. The node is allocated and linked in on hitting the [cci_c]NULL[/cci_c] pointer.

[cc_c] insert_tree(tree **l, item_type x, tree *parent) { tree *p; /* temporary pointer */ if (*l == NULL) { p = malloc(sizeof(tree)); /* allocate new node */ p->item = x; p->left = p->right = NULL; p->parent = parent; *l = p; return; } if (x < (*l)->item) insert_tree(&((*l)->left), x, *l); else insert_tree(&((*l)->right), x, *l); } [/cc_c]#### Deletion from a Tree

Deletion is somewhat trickier than insertion, because removing a node means appropriately linking its two descendant subtrees back into the tree somewhere else. Leaf nodes have no children, and so may be deleted by simply clearing the pointer to the given node.

The case of the doomed node having on child is also straightforward. There is one parent and one grandchild, and we can link the grandchild directly to the parent without violating the in-order labeling property of the tree.

But what of a a to-be-deleted node with two children? **Our solution is to relabel this node with the key of its immediate successor in sorted order.** This successor must be the smallest value in the right subtree, specifically the leftmost descendant in the right subtree. Moving this to the point of deletion results in a properly-labeled binary search tree, and reduces our deletion problem to physically removing a node with at most one child — a case that has been solved above.

The worst-case complexity analysis is as follows. Every deletion requires the cost of a most two search options, each taking O(h) time where h is the height of the tree, plus a constant amount of pointer manipulation.

### How Good Are Binary Search Trees?

Unfortunately, bad things can happen when building trees through insertion. The data structure has no control over the order of insertion. Consider what happens if the user inserts the keys in sorted order.

**This argument is an important example of the power of ****randomization****.** We can often develop simple algorithms that offer good performance with high probability.

### Balanced Search Trees

What would be better is an insertion/deletion procedure which *adjusts* the tree a little after each insertion, keeping it close enough to be balanced so the maximum height is logarithmic. Sophisticated *balanced* binary search tree data structures have been developed that guarantee the height of the tree always to be O(log n).

From an algorithm design viewpoint, it is important to know that these trees exist and that they can be used as black boxes to provide an efficient dictionary implementation. When figuring the costs of dictionary operations for algorithm analysis, we can assume the worst-case complexities of balanced binary trees to be a fair measure.

#### Stop and Think: Exploiting Balanced Search Trees

Problem: You are given the task of reading [cci_c]n[/cci_c] numbers and then printing them out in sorted order. Suppose you have access to a balanced dictionary data structure, which supports the operations search, insert, delete, minimum, maximum, successor, and predecessor each in O(log n) time.

- How can you sort in O(n log n) time using only insert and in-order traversal?
- How can you sort in O(n log n) time using only minimum, successor, and insert?
- How can you sort in O(n log n) time using only minimum, insert, delete?

## Priority Queues

Priority queues are data structures that provide more flexibility than simple sorting, because they allow new elements to enter a system at arbitrary intervals. It is much more cost-effective to insert a new job into a priority queue than to re-sort everything on each such arrival.

The basic priority queue supports three primary operations:

- Insert(Q, x) –
- Find-Minimum(Q) or Find-Maximum(Q) –
- Delete-Minimum(Q) or Delete-Maximum(Q) –

#### Stop and Think: Basic Priority Queue Implementations

*Problem*: what is the worst-case time complexity of the three basic priority queue operations (insert, find-minimum, and delete-minimum) when the basic data structure is

- An unsorted array.
- A sorted array.
- A balanced binary search tree. (Heap?)

## Hashing and Strings

Hash tables are a *very* practical way to maintain a dictionary. They exploit the fact that looking an item up in an array takes constant time once you have its index. A hash function is a mathematical function that maps keys to integers. We will use hte value of our hash functions as an index into an array, and store our item at that position.

The first step of the hash function is usually to map each key to a big integer. Let [katex]\alpha[/katex] be the size of the alphabet on which a given string [katex]S[/katex] is written. Let [cci]char(c)[/cci] be a function that maps each symbol of the alphabet to a unique integer from [katex]0[/katex] to [katex]\alpha-1[/katex]. The function

[katex display=true]H(S)=\sum_{i=0}^{|S|-1}\alpha^{m-(i+1)} \times char(s_{i+j})[/katex]maps each string to a unique (but large) integer by treating the characters of the string as “digits” in a base-[katex]\alpha[/katex] number system.

The result is unique identifier numbers, but they are so large they will quickly exceed the number of slots in our hash table (denoted by [katex]m[/katex]). We must reduce this number to an integer between 0 and [katex]m – 1[/katex], by taking the reminder [cci]H(S) mod m[/cci]. If the table size is selected with enough finesse (ideally [katex]m[/katex] is a large prime not too close to [katex]2^i – 1[/katex]), the resulting hash values should be fairly uniformly distributed.

### Collision Resolution

No matter how good our hash function is, we had better be prepared for collisions, because two distinct keys will be ocasionally hash to the same value. *Chaining* is the easiest approach to collision resolution. Represent the hash table as an array of m linked lists. The [katex]i[/katex]th list will contain all the items that hash to the value of [katex]i[/katex].

Chaining is very natural, but devotes a considerable amount of memory to pointers. This is space that could be used to make the table larger, and hence the “lists” smaller.

The alternative is something called *open addressing*. The hash table is maintained as an array of elements (not buckets), each initialized to null. On an insertion, we check to see if the desired position is empty. If so, we insert it. If not, we must find some other place to insert it instead. The simplest possibility (called *sequential probing)* inserts the item in the enxt open spot in the table. If the table is not too full, the contiguous runs of items should be fairly small, hence this location *should* be only a few slots from its intended position.

Deletion in an open addressing scheme can get ugly, since removing one element might break a chain of insertions, making some leements inaccessible. We have no alternative but to reinsert all the items in the run folloing the new hole.

Chaining and open addressing both require O(m) to initialize an m-element hash table to null elements prior to the first insertion. Traversing all elements in the table takes O(n+m) time for chaining, becaseu we have to scan all m buckets looking for elements, even if the actual number of inserted items is small. This reduced to O(m) time for open addressing, since n must be at most m.

Pragmatically, a hash table is often the best data structure to maintain a dictionary.

### Efficient String Matching via Hashing

The most fundamental operation on text strings is substring search, namely:

*Problem:* Substring Pattern Matching

*Input:* A text string *t* and a pattern string *p*.

*Output:* Does t contain the pattern p as substring, and if so where?

Here we give a linear *expected-time* algorithm for string matching, called the Rabin-Karp algorithm.

### Duplicate Detection via Hashing

The key idea of hashing is to represent a large object (be it a key, a string, or a substring) using a single number. The goal is a representation of the large object by an entity that can be manipulated in constant time, such that it is relatively unlikely that two different large objects map to the same value.

Consider the following problems with nice hashing solutions:

*Is a given document different from all the rest in a large corpus?**Is part of this document plagiarized from a document in a large corpus?**How can I convince you that a file isn’t changed?*

## Specialized Data Structures

*String data structures**Geometric data structures**Graph data structures**Set data structures*