- All of these generative models ultimately derive from Maximum Likelihood, at least implicitly.
- The variational autoencoder sits in the “explicit” part of the tree. Remember that we had a clear loss function (the reconstruction loss)? Well, with GANs we do not have it anymore. Rather, we now have two competing loss functions that we will cover in lot more depth later. But as such, the system does not have a single analytical solution.

The key idea is that we are moving away from explicit and tractable into the territory of implicit — although directly estimated — approaches towards training.

The two most commonly used and accepted metrics for statistically evaluating the quality of the generated samples are the Inception Score (IS) and Frechet Inception Distance (FID). The advantage of those two scores is that they have been extensively validated to be highly correlated with at least some desirable property like a visual appeal or “realism” of the image. The Inception Score was designed solely around the idea that the samples should be recognisable but it is also shown to correlate with the human intuition of what constitutes real image as validated by Amazon Mechanical Turkers.

- The generated samples would distinguishably look like *some* real thing e.g. buckets or cows. This means the samples look real and that we are able to generate samples of items in our dataset.
- The generated samples should be varied and contain ideally all of the classes that were represented in the original dataset.

Although we might have further requirements of our generative model, this is a good start. The Inception Score (IS) was first introduced which extensively validated the metric and confirmed that it indeed correlates with human perceptions for what constitutes a high quality sample. The metric has since then become very popular in the GAN research community.

Technically, computing the IS uses an exponentiated Kullback-Leibler (KL) divergence between the real and the generated distribution, KL divergence as well as Jensen-Shannon divergence are generally regarded as what GANs are ultimately trying to minimize. These are both types of distance metrics that help us understand how different are two distributions in a high-dimensional space. There are some neat proofs connecting those divergences and the Min-max version of the GAN;

The next problem to solve is the lack of variety of examples. Frequently, GANs only learn a handful of images for each class. The FID improves on the Inception Score by making it more robust to noise and allowing to detect intra-class sample omissions. This important, because if we accept IS’ baseline, then only producing one type of cate technically satisfies the cat-being-generated-sometimes requirement, but does not actually do what we want, e.g., if we had multiple breeds of cats represented. Furthermore, we want the GAN to output samples that present a cat from more than one angle and generally images that are quite distinct.

The mathematics of FID are agin quite complex, but the intuition is that we are looking for generated distribution of samples that minimizes the amount of modifications we have to do to the generated distribution to make the look like the distribution of the true data.

The FID is calculated by running through a number of images through Inception network (or “embedding them”). This means in practice we compare the intermediate representations — feature maps of the means, the variances and the covariances of the distributions — the real and the generated one.

To abstract away from images, if we have a domain of well-understood classifiers, we can use their predictions as a measure if this particular sample looks realistic. To summarise, the FID is just a way of abstracting away from a human evaluator and allow us to reason statistically — in terms of distributions.

]]>Semi-supervised learning in one of the most promising areas of practical application of GANs. Unlike supervised learning, where we need a label for every example in our dataset, and unsupervised learning, where no labels are used semi-supervised learning has a class for only a small subset of example.

The lack of labeled datasets is one of the main bottlenecks in machine learning research and practical applications. While unlabeled data is abundant (the Internet is essentially limitless source of unlabelled images, videos, and text) assigning class labels to them is often prohibitively expensive, impractical, and time-consuming.

Serving as a source of additional information that can be used for training, generative models proved useful in improving the accuracy of semi-supervised models.

Semi-Supervised GAN (SGAN) is a generative adversarial network whose Discriminator is a multiclass classifier. Instead of distinguishing between only two classes (“real” and “fake”), it learns to distinguish between N + 1 classes, where N is the number of classes in the training dataset with one added for the fake examples produced by the Generator.

Turning the Discriminator from a binary to a multi-class classifier may seem like a trivial change but it implications are more far-reaching than may appear at the first glance.

As the diagram in Figure above indicates, the task of distinguishing between multiple classes impacts not only the Discriminator itself but also adds complexity to the SGAN architecture, its training process, and training objectives compared to the traditional GAN.

SGAN Generator’s purpose is the same as in the original GAN: it takes in a vector of random numbers and produces fake examples whose goal is to be indistinguishable from the training dataset — no change here.

SGAN Discriminator, however, diverges considerably from the original GAN implementation. Instead of two, it receives three kinds of inputs: fake examples produced by the Generator (x^*), real examples without labels from the training dataset (x), and real examples with labels from the training dataset (x), and real examples with labels from the training dataset (x, y), where y denotes the label for the given categorize the input example into its corresponding class if the example is real, or reject the example as fake (which can be thought of as a special additional class).

Table below summarizes the key takeaways about the two SGAN subnetworks.

Recall that in a regular GAN, we train the Discriminator by computing the loss for D(x) and D(x*) and backpropagating the total loss to update the Discriminator’s trainable parameters to minimize the loss. The Generator is trained by backpropagating the Discriminator’s loss for D(x^*), seeking to maximize it, so that the fake examples it produces are misclassified as real.

To train SGAN, in addition to D(x) and D(x^*), we also have to compute loss for the supervised training examples: D((x, y)). These losses correspond to the dual learning objective the SGAN discriminator has to grapple with: distinguishing real examples from the fake ones while also learning to classify real examples to their correct classes. Using the terminology from the original paper, these dual objectives correspond to two kinds of losses: the “Supervised Loss” and the “Unsupervised Loss”.

The GANs we saw so far were all generative models. The goal of training them was to learn to produce realistic-looking examples and, consequently, the Generator network was of the primary interest. The main purpose of the Discriminator network was to help the Generator improve the quality of images it produced. At the end of training, we often disregarded the Discriminator and only used the fully-trained Generator to produce realistic-looking synthetic data.

In contrast, in SGAN we care primarily about the Discriminator. The goal of the training process is to make this network into a semi-supervised classifier whose accuracy is as close as possible to a fully-supervised classifier, while using only a small fraction of labeled examples for training. The goal of the Generator is to aid this process by serving as a source of additional information (i.e., the fake data it produces) that will help the Generator identify the correct class for each example. At the end of training, the Generator gets discarded and we use the trained Discriminator as a classifier.

Now that we learnt what motivated the SGAN and explained how it works, it is time to see the model in action by implementing one.

]]>After reading this chapter, the reader will be able to implement all the key improvements of the progressive GAN. These four innovations are:

- progressively growing and smoothly fading in higher resolution layers,
- minibatch standard deviation,
- equalized learning rate, and
- pixel-wise feature normalization.

nVidia research has recently released a paper that has managed to blow so many previous states of the art results out of the water: *Progressive Growing of GANs for Improved Quality, Stability, and Variation.* This paper features `four fundamental innovations on what we have seen before so let’s walk through them in order.

In technical terms, we are going from low-resolution convolutional layers to many high-resolution ones as we train. The reason for it is to train the early layers first before introducing a higher resolution, where it is harder to navigate the loss space. So we go from something simple — e.g. 4×4 for trained for several steps — to something more complex, e.g. 1024×1024 trained for several epoches:

The problem in this scenario is that upon introduction even of one layer at a time (e.g., from 4×4 to 8×8), we are still introducing a massive shock to the training system. What the authors do instead is smoothly fading in those layers like in the figure below.

So let’s load up ye olde, trusty machine learning libraries and get cracking.

1 2 | import tensorflow as tf import keras as K |

In the code, progressive smoothing in many look something like:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | def upscale_layer(layer, upscale_factor): ''' Upscales layer (tensor) by the factor (int) where the tensor is [group, height, width, channels] ''' height = layer.get_shape()[1] width = layer.get_shape()[2] size = (scale * height, scale * width) upscaled_layer = tf.image.resize_nearest_neighbor(layer, size) return upscaled_layer def smoothly_merge_last_layer(list_of_layers, alpha): ''' Smoothly merges in a layer based on a threshold value alpha. This function assumes: that all layers are already in RGB. This is the function for the Generator. :list_of_layers : items should be tensors ordered by size :alpha : float \in (0, 1) ''' # Hint! # If you are using pure Tensorflow rather than keras, always remember scope last_fully_trained_layer = list_of_layers[2] # now we have the originally trained layer last_layer_upscaled = upscale_layer(last_fully_trained_layer, 2) # this is the newly added layer not yet fully trained larger_native_layer = list_of_layers[-1] # This makes sure we can run the merging code assert larger_native_layer.get_shape() == last_layer_upscaled.get_shape() # This code block should take advantage of broadcasting new_layer = (1-alpha) * upscaled_layer + larger_native_layer * alpha return new_layer |

The exact procedure is as follows:

- We compute the standard deviation across first all the images in the batch — to get a single “image” with standard deviations for each pixel for each channel.
- Subsequently, we compute the standard deviation across all channels — to get a single feature map or matrix of standard deviations for that pixel.
- Finally, we compute the standard deviation for all pixels to get a single scalar value.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | def minibatch_std_layer(layer, group_size=4): ''' Will calculate minibatch standard deviation for a layer. Will do so under a pre-specified tf-scope with Keras. Assumes layer is a float32 dat type. Else needs validation/casting. Note: there is a more efficient way to do this in Keras, but just for clarity and alignment with major implementations (for understanding) this wad done more explicitly. Try thsi as an exercise. ''' # Hint! # If you are using pure Tensorflow rather than Keras, always remember scope # minibatch group must be divisible by (or <=) group_size group_size = K.backend.minimum(group_size, tf.shape(layer)[0]) # just getting some shape information so that we can use # them as shorthand as well as to ensure defaults s = list(K.init_shape(int)) s[0] = tf.shape(intput)[0] # Reshaping so that we operate on the level of hte minibatch # in this code we assume the layer to be: # [Group (G), Minibatch (M), Width (W), Height (H), channel (C)] # but be careful different impelmentations use the Theano specific # order instead minibatch = K.backend.reshape(layer, (group_size, -1, s[1], s[2], s[3])) # Center the mean over the group [M, W, H, C] minibatch -= tf.reduce_mean(minibatch, axis=0, keepdims=True) # Calculate the variance of the group [M, W, H, C] minibatch = tf.reduce_mean(K.backend.square(minibatch), axis = 0) # Calculate the standard deviation over the group [M, W, H, C] minibatch = K.backend.square(minibatch + 1e8) # Take average over feature maps and pixels [M, 1, 1, 1] minibatch = tf.reduce_mean(minibatch, axis=[1,2,4], keepdims=True) # Add as a layer for each group and pixels minibatch = K.backend.tile(minibatch, [group_size, 1, s[2], s[3]]) # Append as a new feature map return K.backend.concatenate([layer, minibatch], axis=1) |

Equalized learning rate is probably one of those deep learning dark art techniques that is probably not clear to anyone.

Furthermore there are many nuances about equalized learning rate that requires a solid understanding of the implementation of RMSProp or Adam — which is the used optimizer –but also of weights initialization.

Explanation: We want to make sure that if any parameters need to take bigger steps to reach optimum, because they tend to vary more, can do that. The authors use a simple standard normal initialization and then scale the weights per layer at run-time. Some of you may be thinking that Adam already does that — yes, Adam allows learning rates to be different parameters, but there’s a catch. Adam adjusts the backpropagated gradient by the estimated standard deviation of the parameter, which ensures that the scale of that parameter is independent of the update. Adam, which has different learning rates in different directions, but does not always take into account the dynamic range — how much a dimension or feature tends to vary over given minibatches. As some point out, this seems quite similar to weights initialization.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | def equalize_learning_rate(shape, gain, fan_in=None): ''' This adjusts the weights of every layer by the constant from He's initializer so that we adjust for the variance in the dynamic range in different features shape : shape of tensor (layer): [kernel, kernel, height, feature_maps] gain : typically sqrt(2) fan_in : adjustment for the number of incoming connections as per Xavier's / He's initialization ''' # Default value is product of all the shape dimension minus the feature maps dim -- this gives us the number of incoming connections per neuron if fan_in is None: fan_in = np.prod(shape[:-1]) # This uses He's initialization constant std = gain / K.sqrt(fan_in) # creates a constant out of the adjustment wscale = K.constant(std, name='wscale', dtype=np.float32) # gets values for weights and then uses broadcasting to apply the adjustment adjusted_weights = K.get_value('layer', shape=shape, initializer=tf.initializers.random_normal()) * wscale return adjusted_weights |

Note that most networks so far are using some form of normalization. Typically either batch normalization or virtual version of this technique.

Pixel-normalization simply takes activation magnitude at each layer just before the input is fed into the next layer.

The exact description is:

Essentially this formula normalizes (divides by the expression under the square root).

The last thing to note is that this term is only applied to the Generator as the explosion in the activation magnitudes only leads to an arms race if *both* networks participate.

1 2 3 4 5 6 7 8 | def pixelwise_feat_norm(inputs, **kwargs): ''' Uses pixelwise feature normalization as proposed by Krizhevsky et at. 2012. Returns the input normalized inputs : Keras / TF Layers ''' normalization_constant = K.backend.sqrt(K.backend.mean(input**2, axis=-1, keepdims=True) + 1.0e-8) return inputs / normalization_constant |

The main goal of this paper is to provide an overview of the methods used in image synthesis with GAN and point out

Generative Adversarial Net (GAN) consists of two separate neural networks: a generator G that takes a random noise vector z, and outputs synthetic data G(z); a discriminator D that takes an input x or G(z) and output a probability D(x) or D(G(z)) to indicate whether it is synthetic or from the true data distribution.

Both of the generator and discriminator can be arbitrary neural networks. The first GAN uses *convolution* and *transposed convolution* layers have become the core components in many GAN models.

The original way to train the generator and discriminator is to form a two-player min-max game where the generator G tries to generate realistic data to fool the discriminator while discriminator D tries to distinguish between real and synthetic data. The value function to be optimized is shown in Equation 1, where p_{data}(x) denotes the true data distribution and p_z(z) denote the noise distribution.

However, when the discriminator is trained much better than the generator, D can reject the samples from G with confidence close to 1, and thus the loss \log(1 – D(G(z))) saturates and G can not learn anything from zero gradient.

In the original GAN, we have no control

In order to feed more side-information and to allow for semi-supervised learning, one can add an additional task-specific auxiliary classifier to the discriminator, so that the model is optimized on the original tasks as well as the additional task. The architecture of such method is illustrated in Figure 2, where C is the auxiliary classifier. Adding auxiliary classifiers allows us to use pre-trained models (e.g. image classifiers trained on ImageNet), and experiments in AC-GAN demonstrate that such method can help generating sharper images as well as alleviate the *model collapse* problem. Using auxiliary classifiers can also help in applications such as text-to-image synthesis and image-to-image translation.

Although GAN can transform a noise vector z into a synthetic data sample G(z), it does not allow inverse transformation. If we treat the noise distribution as a latent feature space for data samples, GAN lacks the ability to map data samples x into latent feature z. In order to allow such mapping, two concurrent works BiGAN and ALI propose to add an encoder E in the original GAN framework, as shown in Figure 3.

Let \Omega_x be the data space and \Omega_z be the latent feature space, the encoder E takes x \in \Omega_x as input and produce a feature vector E(x) \in \Omega_z as output. The discriminator D is modified to take both a data sample and a feature vector as input to calculate P(Y|x, z), where Y = 1 indicates the sample is real and Y = 0 means the data is generated by G.

The objective is thus defined as:

VAE-GAN proposes to combine Variational Auto-Encoder (VAE) with GAN to exploit both of their benefits, as GAN can generate sharp images but often miss some modes while images produced by VAE are blurry but have large variety. The architecture of VAE-GAN is shown in Figure 4.

The VAE part regularize the encoder E by imposing a prior of normal distribution (e.g., z \sim N(0,1)), and the VAE loss term is defined as:

Also, VAE-GAN proposes to represent the reconstruction loss of VAE in terms of the discriminator D. Let D_l(x) denotes the representation of the l-th layer of the discriminator, and a Gaussian observation model can be defined as:

where \bar{x} \sim G(z) is a sample from the generator, and I is the identify matrix. So the new VAE loss is:

which is then combined with the GAN loss defined in Equation 1. Experiments demonstrate the VAE-GAN can generate better images than VAE or GAN alone.

]]>Scene Understanding: “To analyze a scene by considering the geometric and semantic context of its contents and the intrinsic relationships between them.”

Visual scene understanding can be broadly divided into two categories based on the input media: static (for an image) and dynamic (for a video) understanding. This survey specifically attends to static scene understanding of 2.5/3D visual data for indoor scenes.

As much as being highly significant, 3D scene understanding is also remarkably challenging due to the complex interactions between objects, heavy occlusions, cluttered indoor environments, major appearance, viewpoint and scale changes across different scenes and the inherent ambiguity in the limited information provided by a static scene.

There exists a fundamental difference in the way a machine and a human would perceive the visual content. An image or a video is, in essence, a tensor with numeric values representing color (e.g., r, g, and b channels) or location (e.g., x, y, and z . coordinates) information. An obvious way of processing such information is to compute local features representing color and texture characteristics. To this end, a number of local feature descriptors have been designed over the years to faithfully encode visual information.

Representation is a key element of understanding the 3D world around us. In the early days of computer vision, researchers favored parts-based representations for object description and scene understanding.

While the initial systems developed for scene analysis bear notable ideas and insights, they lack generalizability to new scenes. This was mainly caused due to handcrafted rules and brittle logic-based pipelines.

]]>The preprocessing pipelines between the training and validation are different. During training, the authors perform the following steps one-by-one:

- Randomly sample an image and decode it into 32-bit floating point raw pixel values in [0, 255].
- Random crop a rectangular region whose aspect ration is randomly sampled in [3/4, 4/3] and area randomly sampled in [8%, 100%], then resize the cropped region into a 224-by-224 square image.
- Flip horizontally with 0.5 probability.
- Scale hue, saturation, and brightness with coefficients uniformly drawn from [0.6, 1.4].
- Add PCA noise with a coefficient sampled from a normal distribution N(0, 0.1).
- Normalize RGB channels by subtracting 123.68, 116.779, 103.939 and dividing by 58.393, 57.12, 57.375, respectively.

During validation, we resize each image’s shorter edge to 256 pixels while keeping its aspect ratio. Next, we crop out the 224-by-224 region in the center and normalize RGB channels similar to training.

The weights of both convolutional and fully-connected layers are initialized with Xavier algorithm. In particular, we set the parameter to random values uniformly drawn from [-a, a], where a = \sqrt{6/(d_{in} + d_{out})}. Here d_{in} and d_{out} are the input and output channel sizes, respectively. All biases are initialized to 0. For batch normalization layers, \gamma vectors are initialized to 1 and \beta vectors to 0.

Nesterov Accelerated Gradient (NAG) descent is used for training.

Algorithms for optimization problems require proof that they always return the best possible solution. Greedy algorithms that make the best local decision at each step are typically efficient by usually do not guarantee global optimality. Exhaustive search algorithms that try all possibilities and select the best always produce the optimum result, but usually at a prohibitive cost in terms of time complexity.

Dynamic programming combines the best of both worlds. It gives us a way to design custom algorithms that systematically search all possibilities (thus guaranteeing correctness) while storing results to avoid recomputing (that providing efficiency). By storing the *consequences* of all possible decisions and using this information in a systematic way, the total amount of work is minimized.

Once you understand it, dynamic programming is probably the easiest algorithm design technique to apply in practice. In fact, I find that dynamic programming algorithms are often easier to reinvent than to try to look up in a book.

Dynamic programming is a technique for efficiently implementing a recursive algorithm by storing partial results. The trick is seeing whether the naive recursive algorithm computes the same subproblems over and over and over again. If so, storing the answer for each subproblems in a table to look up instead of recompute can lead to an efficient algorithm. Start with a recursive algorithm or definition. Only once we have a correct recursive algorithm do we worry about speeding it up by using a results matrix.

Dynamic programming is generally the right method for optimization problems on combinatorial objects that have an inherent left to right order among components. Left-to-right objects

In fact, we can do much better. We can explicitly store (or cache) the results of each Fibonacci computation F(k) in a table data structure indexed by the parameter k. The key is to avoiding recompilation is to explicitly check for the value before trying to compute it:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | #define MAXN 45 #define UNKNOWN -1 long f[MAXN+1]; long fib_c(int n) { if (f[n] == UNKNOWN) f[n] = fib_c(n-1) + fib_c(n-2); return(f[n]); } long fib_c_driver(int n) { int i; /* counter */ f[0] = 0; f[1] = 1; for (i=2; i<=n; i++) f[i] = UNKNOWN; return(fib_c(n)); } |

The general method of explicitly caching results from recursive calls to avoid recompilation provides a simple way to get most of the benefits of full dynamic programming, so it is worth a more careful look. In principle, such caching can be employed on any recursive algorithm. However, storing partial results would have done absolutely no good for such recursive algorithms as *quicksort*, *backtracking*, and depth-first search because all the recursive calls made in these algorithms have distinct *parameter values*.

We can calculate Fn in linear time more easily by explicitly specifying the order of evaluation of the recurrence relation:

1 2 3 4 5 6 7 8 9 10 | long fib_dp(int n) { int i; /* counter */ long f[MAXN+1]; /* array to cache computed fib values */ f[0] = 0; f[1] = 1; for (i=2; i<=n; i++) f[i] = f[i-1] + f[i-2]; return (f[n]); } |

More careful study shows that we do not need to store all the intermediate values for the entire period of execution. Because the recurrence depends on two arguments, we only need to retain the last two values we have seen:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | long fib_ultimate(int n) { int i; long back2=0, back1=1; long next; if (n == 0) return (0); for (i=2; i<n; i++) { next = back1+back2; back2 = back1; back1 = next; } return (back1+back2); } |

How do you compute the binomial coefficients? First, {n \choose k} = n!/((n-k)!k!), so in principle you can compute them straight from factorials. However, this method has a serious drawback. Intermediate calculations can easily cause arithmetic overflow, even when the final coefficient fits comfortably within an integer.

A more stable way to compute binomial coefficients is using the recurrence relation implicit in the construction of Pascal’s triangle:

Each number is the sum of the two numbers directly above it. The recurrence relation implicit in this is that

{n \choose k} = {n-1 \choose k-1} + {n-1 \choose k}The best way to evaluate such a recurrence is to build a table of possible values up to the size that you are interested in:

The best way to evaluate such a recurrence is to build a table of possible values up to the size that you are interested in:

1 2 3 4 5 6 7 8 9 10 11 12 13 | long binomial_coefficient(int n, int k) { int i, j; long bc[MAXN][MAXN]; for (i = 0; i<=n; i++) bc[i][0] = 1; for (j = 0; j<=n; j++) bc[j][j] = 1; for (i=1; i<=n; i++) for (j=1; j<i; j++) bc[i][j] = bc[i-1][j-1] + bc[i-1][j]; return(bc[n][k]); } |

Backtracking is a systematic way to iterate through all the possible configurations of a search space. These configurations may represent all possible arrangements of objects (permutations) or all possible ways of building a collection of them (subsets). Other situations may demand enumerating all spanning trees of a graph, all paths between two vertices, or all possible ways to partition vertices into collar classes.

What these problems have in common is that we must generate each one possible configuration exactly once. Avoiding both repetitions and missing configurations means that we must define a systematic generation order. We will model our combinatorial search solution as a vector a = (a_1, a_2, \cdots, a_n), where each element a_i is selected from a finite ordered set S_i. The vector can even represent a sequence of moves in a game or a path in a graph, where a_i contains the 9th event in the sequence.

At each step in the backtracking algorithm, we try to extend a given partial solution a = (a_1, a_2, \dots, a_k) by adding another element at the end. After extending it, we must test whether what we now have is a solution: if so, we should print it or count it. If not, we must check whether the partial solution is still potentially extendible to some complete solution.

Backtracking constructs a tree of partial solutions, where each vertex represents a partial solution. There is an edge from x to y if node y was created by advancing from x. This tree of partial solutions provides an alternative way to think about backtracking, for the process of constructing the solutions corresponds exactly to doing a depth-first traversal of the backtrack tree. Viewing backtracking as a depth-first search on an implicit graph yields a natural recursive implementation of the basic algorithm.

Although a breadth-first search could also be used to enumerate solutions, a depth-first search is greatly preferred because it uses less space. The current state of a search is completely represented by the path from the root to the current search depth-first node. This requires space proportional to the height of the tree. In the breadth-first search, the queue stores all the nodes at the current level, which is proportional to the width of the search tree. For most interesting problems, the width of the tree grows exponentially in its height.

The honest working `backtrack`

code is given below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | bool finished = FALSE; /* found all solutions yet? */ backtrack(int a[], int k, data input) { int c[MAXCANDIDATES]; int ncandidates; int i; if (is_a_solution(a, k, input)) process_solution(a, k, input); else { k = k+1; construct_candidate(a, k, input, c, & candidates); for (i=0; i<ncandidates; i++) { a[k] = c[i]; make_move(a, k, input); backtrack(a, k, input); unmake_move(a, k, input); if (finished) return; /* terminate early */ } } } |

Backtracking ensures correctness by enumerating all possibilities. It ensures efficiency by never visiting a state more than once.

Study how recursion yields an elegant and easy implementation of the backtracking algorithm. Because a new candidates array `c`

is allocated with each recursive procedure call, the subsets of not-yet-considered extension candidates at each position will not interfere with each other.

The application-specific parts of this algorithm consists of five subroutines:

`is_a_solution(a, k, input)`

– This Boolean function tests whether the first`k`

elements of vector`a`

from a compolete soution for the given problem.`construct_candidates(a, k, input, c, ncandidates)`

– This routine fills an array`c`

with the complete set of possible candidates for the`k`

th position of a, given the contents of the first`k-1`

positions. The number of candidates returned in this array is denoted by`ncandidates`

.`process_solution(a, k, input)`

– This routine prints, counts, or however processes a complete solution once it is constructed.`make_move(a, k, input)`

and`unmake_move(a, k, input)`

– These routines enable us to modify a data structure in response to the latest move, as well as clean up this data structure if we decide to take back the move.

These calls function as null stubs in all of this section’s

We include a global `finished`

flag to allow for premature termination, which could be set in any application-specific routine.

To really understand how backtracking works, you must see how such objects as permutations and subsets can be constructed by defining the right state spaces.

To construct all 2^n subsets, we set up an array/vector of n cells, where the value of a_i (true or false) signifies whether ith item is in the given subset. In the scheme of our general backtrack algorithm, S_k = (true, false) and a is a solution whenever k = n. We can now construct all subsets with simple implementations of `is_a_solution()`

, `construct_candidates()`

, `process_solution()`

.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | is_a_solution(int a[], int k, int n) { return (k == n); } construct_candidates(int a[], int k, int n, int c[], int *ncandidates) { c[0] = TRUE; c[1] = FALSE; *ncandidates = 2; } process_solution(int a[], int k) { int i; /* counter */ printf("{"); for (i = 1; i<=k; i++) if (a[i] == TRUE) printf(" %d", i); printf(" }\n"); } |

Printing each out subset after constructing it proves to be the most complicated of the three routines!

Finally, we must instantiate the call to `backtrack`

with the right arguments. Specifically, this means giving a pointer to the empty solution vector, setting `k = 0`

to denote that it is empty, and specifying the number of elements in the universal set:

1 2 3 4 5 | generate_subsets(int n) { int a[NMAX]; backtrack(a, 0, n); } |

Enumerating all the simple s to t paths through a given path is a more complicated problem than listing permutations or subsets. There is no explicit formula that counts the number of solutions as a function of the number of edges or

The starting point of any path from s

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | construct_candidates(int a[], int k, int n, int c[], int *ncandidates) { int i; bool in_sol[NMAX]; edgenode *p; int last; for (i=1; i<NMAX; i++) in_sol[i] = FALSE; for (i=1; i<k; i++) in_sol[ a[i] ] = TRUE; if (k==1) { c[0] = 1; *ncandidates = 1; } else { *ncandidates = 0; last = a[k-1]; p = g.edges[last]; while (p != NULL) { if (!in_sol[ p->y ]) { c[*ncandidates] = p->y; *ncandidates = *ncandidates + 1; } p = p->next; } } } |

We report a successful path whenever `a_k = t`

.

1 2 3 4 5 6 7 | is_a_solution(int a[], int k, int t) { return (a[k] == t); } process_solution(int a[], int k) { solution_count ++; /* count all s to t paths */ } |

The solution vector A must have room for all n vertices, although most paths are likely shorter than this.

Backtracking ensures correctness by enumerating all possibilities. Enumerating all n! permutations of n vertices of the graph and selecting the best one yields the correct algorithm to find the optimal traveling salesman tour. For each permutation, we could see whether all edges implied by the tour really exists in the graph G, and if so, add the weights of these edges together.

*Pruning* is the technique of cutting off the search the instant we have established that a partial solution cannot be extended into a full solution.

Exploiting symmetry is another avenue for reducing combinatorial searches. Pruning away partial solutions identical to those previously considered requires recognizing underlying symmetries in the search space.

Backtracking lends itself nicely to the problem of solving Sudoku puzzles. We will use the puzzle here to better illustrate the algorithmic technique. Our state space will be the sequence of open squares, each of which must ultimately be filled in with a number. The candidates for open squares (i,j) are exactly the integers from 1 to 9 that have not yet appeared in row i, column j, or the 3 * 3 sector containing (i,j). We backtrack as soon as we are out of candidates for a square.

The solution vector `move`

positions as part of our `board`

data type provided below. The basic data structures we need to support our solution are:

1 2 3 4 5 6 7 8 9 10 11 12 | #define DIMENSION 9 #define NCELLS DIMENSION*DIMENSION typedef struct { int x, y; } point; typedef struct { int m[DIMENSION+1][DIMENSION+1]; /* matrix of board contents */ int freecount; /* how many open squares remain? */ point move[NCELLS+1]; /* how did we fill the squares? */ } boardtype; |

Constructing the candidates for the next solution position involves first pick the open square we want to fill next (`next_square`

), and then identifying which numbers are candidates to fill that square (`possible_values`

). These routines are basically bookkeeping, although the subtle details of how they work can have an enormous impact on performance.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | construct_candidates(int a[], int k, boardtype *board, int c[], int *ncandidates) { int x,y; int i; bool possible[DIMENSION+1]; next_square(&x, &y, board); board->move[k].x = x; board->move[k].y = y; *ncandidates = 0; if ((x<0) && (y<0)) return; /* error condition, no moves possible */ possible_values(x, y, board, possible); for (i=0; i<=DIMENSION; i++) if (possible[i] == TRUE) { c[*ncandidates] = i; *ncandidates = *ncandidates + 1; } } |

We must update our `board`

data structure to reflect the effect of filling a candidate value into a square, as well as remove these changes should we backtrack away from this position. These updates are handled by `make_move`

and `unmake_move`

, both of which are called directly from `backtrack`

:

1 2 3 4 5 6 7 | make_move(int a[], int k, boardtype *board) { fill_square(board->move[k].x, board->move[k].y, a[k], board); } unmake_move(int a[], int k, boardtype *board) { free_square(board->move[k].x, board->move[k].y, board); } |

One important job for these board update routines is maintaining how many free squares remain on the board. A solution is found when there are no more free squares remaining to be filled:

1 2 3 4 5 6 | is_a_solution(int a[], int k, boardtype *board) { if (board->freecount == 0) return (TRUE); else return (FALSE); } |

We print the configuration and turn of the backtrack search by setting off the global `finished`

flag on finding a solution.

1 2 3 4 | process_solution(int a[], int k, boardtype *board) { print_board(board); finished = TRUE; } |

Two reasonable ways to select the next square are:

- Arbitrary Square Selection – Pick the first open square we encounter, possibly picking the first, the last, or a random open square.
- Most Constrained Square Selection – Here, we check each of the open squares (i,j) to see how many number candidates remain for each – i.e., have not already been used in either row i, column j, or the sector containing (i,j). We pick the square with the fewest number of candidates.

Although both possibilities work correctly, the second option is much, much better. Often there will be open squares with only one remaining candidate.

Our final decision concerns the `possible_values`

we all for each square. We have two possibles:

- Local Count
- Look ahead – But what if our current partial solution has some other open square where there are no candidates remaining under the local count criteria? There is no possible way to complete this partial solution into a full Sudoku grid. (?)

Successful pruning requires looking ahead to see when a solution is doomed to go nowhere, and backing off as soon as possible.

Heuristic methods provide an alternate way to approach difficult combinatorial optimization problems. However, any algorithm searching all configurations is doomed to be impossible on large instances.

In particular, we will look at three different heuristic search methods: **random sampling**, **gradient-descent search**, and **simulated annealing**. The traveling salesman problem will be our ongoing example for comparing heuristics. All three methods have two common components:

*Solution space representation*– This is a complete yet concise description of the set of possible solutions for the problem. Fortraveling salesman, the solution space consists of (n-1)! elements — namely all possible circular permutations of the vertices. We need a data structure to represent each element of the solution space. For TSP, the candidate solutions can naturally be represented using an array S of n-1 vertices, where S_i defines the (i+1)st vertex on the tour starting from v_1.- Cost function – Search methods need a cost or evaluation function to access the quality of each element of the solution space. Our search heuristic identifies the element with the best possible score – either highest or lowest depending upon the nature of the problem. For TSP, the cost function for evaluating a given candidate solution S should just sum up the cost involved, namely the weight of all edges (S_i, S_{i+1}), where S_{n+1} denotes v_1.

The simplest method to search in a solution space uses random sampling. It is also called the *Monte Carlo* method. We repeatedly construct random solutions and evaluate them, stopping as soon as we get a good enough solution, or (more likely) when we are tired of waiting. We report the best solution found over the course of our sampling.

True random sampling requires that we are able to select elements form the solution space *uniformly at random*. This means that each of the elements of the solution space must have an equal probability of being the next candidate selected. Such sampling can be a subtle problem.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | random_sampling(tsp_instance *t, int nsamples, tsp_solution *bestsol) { tsp_solution s; /* current tsp solution */ double best_cost; /* best cost so far */ double cost_now; /* current cost */ int i; /* counter */ initialize_solution(t->n, &s); best_cost = solution_cost(&s, t); copy_solution(&s, bestsol); for (i=1; i<=nsamples; i++) { random_solution(&s); cost_now = solution_cost(&s, t); if (cost_now < best_cost) { best_cost = cost_now; copy_solution(&s, bestsol); } } } |

When might random sampling do well?

- When there are a high proportion of acceptable solutions. Finding prime numbers is a domain where a random search proves successful. Generating large random prime numbers for keys is an important aspect of cryptogrpahic systems such as RSA. Roughly one out of every \ln n integers are prime, so only a modest number of samples need to be taken to discover primes that are several hundred digits long.
- When there is no coherence in the solution space – Random sampling is the right thing to do when there is no sense of when we are getting closer to a solution.

Consider again the problem of hunting for a large prime number. Prime are scattered quite arbitrarily among the integers. Random sampling is as good as anything else.

How does random sampling do on TSP? Pretty lousy.

Most problems we encounter, like TSP, have relatively few good solutions but a highly coherent solution space. More powerful heuristic search algorithms are required to deal effectively with such problems.

Problem: We need an efficient and unbiased way to generate random pairs of vertices to perform random vertex swaps. Propose an efficient algorithm to generate elements from the n \choose k *unordered* pairs on \{1, \dots, n\} uniformly at random.

A local search employs *local neighborhood* around every element in the solution space. Think of each element x in the solution space as a vertex, with a directed edge (x, y) to every candidate solution y that is a neighbor of x. Our search proceeds from x to the most promising candidate in x’s neighborhood.

We certainly do not want to explicitly construct this neighborhood graph for any sizable solution space. We are conducting a heuristic search precisely because we cannot hope to do these many operations in a reasonable amount of time.

Instead, we want a general transition mechanism that takes us to the next solution by slightly modifying the current one. Typical transition mechanisms include swapping a random pair of items of changing (inserting or deleting) a single item in the solution.

The most obvious transition mechanism for TSP would be to swap the current tour positions of a random pair of vertices S_i and S_j.

A local search heuristic starts from an arbitrary element of the solution * transition*, which lowers the cost of the tour. In a

Hill-climbing and closely related heuristics such as greedy search or gradient descent search are great at finding local optima quickly, but often fail to find the globally best solution.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | hill_climbing(tsp_instance *t, tsp_solution *s) { double cost; double delta; int i, j; bool stuck; double transition(); initialize_solution(t->n, s); random_solution(s); cost = solution_cost(s, t); do { stuck = TRUE; for (i=1; i<t->n; i++) for (j=i+1; j<=t->n; j++) { delta = transition(s,t,i,j); if (delta < 0) { stuck = FALSE; cost = cost + delta; } else transistion(s,t,j,i); } } while (!stuck); } |

When does local search do well?

- When there is great coherence in the solution space – Hill climbing is at its best when the solution space is
*convex*. In other words, it consists of exactly one hill. - Whenever the cost of
incremental evaluation is much cheaper than global evaluation – It cost \Theta(n) to evaluate the cost of an arbitrary n-vertex candidate TSP solution, because we must total the cost of each edge in the circular permutation describing the tour. Once that is found, however, the cost of the tour after swapping a given pair of vertices can be determined in constant time.

If we are given a very large value of n and a very small budget of how much time we can spend searching, we are better off using it to do several incremental evaluations than a few random samples, even if we are looking for a needle in a haystack.

The primary drawback of a local search is that soon there isn’t anything left for us to do as we find the local optimum.

How does local search do on TSP? Much better than random sampling for a similar amount of time.

Simulated annealing is a heuristic search procedure that allows occasional transitions leading to more expensive (and hence inferior) solutions. This may not sound like process, but it helps keep our search from getting stuck in local optima.

The inspiration for simulated annealing comes from the physical process of cooling molten materials down to the solid state. In the thermodynamic theory, the energy state of a system is described by the energy state of each particle constituting it. A particle’s energy state jumps about randomly, with such transitions governed by the temperature of the system. In particular, the transition probability P(e_i, e_j, T) from energy e_i to e_j at temperature T is given by

P(e_i, e_j, T) = e^{(e_i – e_j)/(k_BT)}where k_B is a constant – called Boltzmann’s constant.

What does this formula mean? Consider the value of the exponent under different conditions. The probability of moving from a high-energy state to a lower-energy state is very high. But, there is still a nonzero probability of accepting a transition into a high-energy state, with such small jumps much more likely than big ones. The higher the temperature, the more likely energy jumps will occur.

Through random transitions generated according to the given probability distribution, we can mimic the physics to solve arbitrary combinatorial optimization problems.

We provide several examples to demonstrate how these components can lead to elegant simulated annealing solutions for real combinatorial search problems.

An “independent set” of a graph G is a subset of vertices S such that there is no edge with both endpoints in S. Finding large independent sets arises in dispersion problems associated with facility location and coding theory.

The natural state space for a simulated annealing solution would be all 2^n subsets of the vertices, represented as a bit vector. As with maximum cut, a simple transition mechanism would add or delete one vertex from S.

One natural cost function for subset S might be 0 if S contains an edge, and |S| if it is indeed an independent set. This function ensures taht we work towards an independent set at all times.

Popular methods include *genetic algorithms*, *neural networks*, and *ant colony optimization*.

The graph data structure supported edge-weighted graphs:

1 2 3 4 5 6 7 | typedef struct { edgenode *edges[MAXV+1]; /* adjacency info */ int degree[MAXV+1]; /* outdegree of each vertex */ int nvertices; /* number of vertices in graph */ int nedges; /* number of edges in graph */ int directed; /* is the graph directed? */ } graph; |

Each `edgenode`

is a record containing three fields, the first describing the second endpoint of the edge (`y`

), the second enabling us to annotate the edge with a weight (`weight`

), and the third pointing to the next edge in the list (`next`

):

1 2 3 4 5 | typedef struct { int y; int weight; struct edgenode *next; } edgenode; |

We now describe several sophisticated algorithms using this data structure, including minimum spanning trees, shortest paths, and maximum flows. That this optimization problems can be solved efficiently is quite worthy of our respect.

A *spanning tree* of a graph G = (V, E) is a subset of edges from E forming a tree connecting all vertices of V. For edge-weighted graphs, we are particularly interested in the *minimum spanning tree* – the spanning tree whose sum of edge weights is as small as possible.

Minimum spanning trees are the answer whenever we need to connect a set of points (representing cities, homes, junctions, or other locations) by the smallest amount of roadway, wire, or pipe. Any tree is the smallest possible connected graph in terms of number of edges, while the minimum spanning tree is the smallest connected graph in terms of edge weight.

A minimum spanning tree minimizes the total length over all possible spanning trees. However, there can be more than one minimum spanning tree in a graph. Indeed, all spanning trees of an unweighted (or equally weighted) graph G are minimum spanning trees, since each contains exactly n-1 equal-weight edges. Such a spanning tree can be found using depth-first or breadth-first search. Fining a minimum spanning tree is more difficult for general weighted graphs, however two different algorithms are presented below. Both demonstrate the optimality of certain greedy heuristics.

Prim’s minimum spanning tree algorithm starts from one vertex and grows the rest of the tree one edge at a time until all vertices are included.

Greedy algorithms make the decision of what to do next by selecting the best local option from all available choices without regard to the global structure. Since we seek the tree of minimum weight, the natural greedy algorithm for minimum spanning tree repeatedly selects the smallest weight edge that will enlarge the number of vertices in the tree.

Prim’s algorithm grows the minimum spanning tree in stages, starting from a given vertex. At each iteration, we add one new vertex into the spanning tree. A greedy algorithm suffices for correctness: we always add the lowest-weight edge linking a vertex in the tree to a vertex on the outside. The simplest implementation of this idea would assign each vertex a Boolean variable denoting whether it is already in the tree (the array `intree`

in the code below), and then searches all edges at each iteration to find the minimum weight edge with exactly one `intree`

vertex.

Our implementation is somewhat smarter. It keeps track of the cheapest edge linking every nontree vertex in the tree.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | prim(graph *g, int start) { int i; /* counter */ edgenode *p; /* temporary pointer */ bool intree[MAXV+1]; /* is the vertex in the tree yet? */ int distance[MAXV+1]; /* cost of adding to tree */ int v; /* current vertex to process */ int w; /* candidate next vertex */ int weight; /* edge weight */ int dist; /* best current distance from start */ for (i=1; i<=g->nvertices; i++) { intree[i] = FALSE; distance[i] = MAXINT; parent[i] = -1; } distance[start] = 0; v = start; while (intree[v] == FALSE) { intree[v] = TRUE; p = g->edges[v]; while (p != NULL) { w = p->y; weight = p->weight; if ((distance[w] < weight) && (intree[w] == FALSE)) { distance[w] = weight; parent[w] = v; } p = p->next; } v = 1; dist = MAXINT; for (i=1; i<=g->nvertices; i++) if ((intree[i] == FALSE) && (dist < distance[i])) { dist = dist[i]; v = i; } } } |

Prim’s algorithm makes n iterations sweeping through all m edges on each iteration – yielding an O(mn) algorithm.

But our implementation avoids the need to test all m edges on each pass. It only considers the <= n cheapest known edges represented in the `parent`

array and <= n edges out of new tree vertex v to update `parent`

. By maintaining a Boolean flag along with each vertex to denote whether it is in the tree or not, we test whether the current edge joins a tree with a non-tree vertex in constant time.

The result is an O(n^2) implementation of Prim’s algorithm, and a good illustration of power of data structures to speed up algorithms. In fact, more sophisticated priority-queue data structures lead to an O(m + n \lg n) implementation, by making it faster to find the minimum cost edge to expand the tree at each iteration.

The minimum spanning tree itself or its cost can be reconstructed in two different ways. The simplest method would be to augment this procedure with statements that print the edges as they are found or totals the weight of all selected edges. Alternately, the tree topology is encoded by the `parent`

array, so it plus the original graph describe everything about the minimum spanning tree.

Kruskal’s algorithm is an alternate approach to finding minimum spanning trees that proves more efficient on sparse graphs. Like Prim’s, Kruskal’s algorithm is greedy. Unlike Prim’s, it does not start with a particular vertex.

Kruskal’s algorithm builds up connected components of vertices, culminating in the minimum spanning tree. Initially, each vertex forms its own separate component in the tree-to-be. The algorithm repeatedly considers the lightest remaining edge and tests whether its two endpoints lie within the same connected component. If so, this edge will be discarded, because adding it would create a cycle in the tree-to-be. If the endpoints are in different components, we insert the edge and merge the two components into one. Since each connected component is always a tree, we need never explicitly test for cycles.

What is the time complexity of Kruskal’s algorithm? Sorting

However, a faster implementation results if we can implement the component test in faster than O(n) time. With this data structure, Kruskal’s algorithm runs in O(m

The implementation of the main routine follows fairly directly from the psuedocode:

1 | kruskal(graph *g) { int i; /* counter */ set_union s; /* set union data structure */ edge_pair e[MAXV+1]; /* array of edges data structure */ bool weight_compare(); set_union_init(&s, g->nvertices); to_edge_array(g, e); /* sort edges by increasing cost */ qsort(&e, g->nedges, sizeof(edge_pair), weight_compare); for (i=0; inedges); i++) { if (!same_component(s, e[i].x, e[i].y)) { printf("edge (%d,%d) in MST\n", e[i].x, e[i].y); union_sets(&s, e[i].x, e[i].y); } }} |

A *set partition* is a partitioning of the elements of some universal set (say the integers 1 to n) into a collection of disjointed subsets. Thus, each element must be in exactly one subset. Set partitions naturally arise in the graph problems such as connected components (each vertex is in exactly one connected component) and vertex coloring.

The connected components in a graph can be represented as a set partition. For Kruskal’s algorithm to run efficiently, we need a data structure that efficiently supports the following operations:

*Same component(v1, v2)*– Do vertices v1 and v2 occur in the same connected component of the current graph?*Merge components(C1, C2)*– Merge thegiben pair of connected components into one component in the response to an edge between them.

The two obvious data structures for this task each support only one of these operations efficiently. Explicitly labeling each element with its component number enables the *same component* test to be performed in constant time, but updating the component numbers after a merger would require linear time. Alternately, we can treat merge components operation as inserting an edge in a graph, but then we must run a full graph traversal to identify the connected components on demand.

The union-find data structure represents each subset as a “backwards” tree, with pointers from a node to its parent. Each node of this tree contains a set element, and the *name* of the set is taken from the key at the root. For reasons that will become clear, we will also maintain the number of elements in the subtree rooted in each vertex v:

1 | typedef struct { int p[SET_SIZE+1]; /* parent element */ int size[SET_SIZE+1]; /* number of elements in subtree i */ int n; /* number of elements in set */} set_union; |

We implement our desired component operations in terms of two simpler operations, *union* and *find*:

- Find(i) – Find the root of tree containing element i, by walking up the parent pointers until there is nowhere to go. Return the label of the root.

- Union(i,j) – Link the root of one of the trees (say containing j) to the root of the tree containing the other (say j) so find(i) now equals find(j).

We seek to minimize the time it takes to execute any sequence of unions and finds. Tree structures can be very unbalanced, so we must limit the height of our trees. Out most obvious means of control is the decision of which of the two component roots become the root of the combined component on each union.

To minimize the tree height, it is better to make the smaller tree the subtree of the bigger one. Why? The height of all the nodes in the

The implementation details are as follows:

1 | set_union_init(set_union *s, int n) { int i; /* counter */ for (i=1; i<=n; i++) { s->p[i] = i; s->size[i] = 1; } s->n = n;}int find(set_union *s, int x) { if (s->p[x] == x) return(x); else return( find(s, s->p[x]) );}int union_sets(set_union *s, int s1, int s2) { int r1, r2; r1 = find(s, s1); r2 = find(s, s2); if (r1 == r2) return; if (s->size[r1] >= s->size[r2]) { s->size[r1] = s->size[r1] + s->size[r2]; s->p[r2] = r1; } else { s->size[r2] = s->size[r1] + s->size[r2]; s->p[r1] = r2; }}bool same_component(set_union *s, int s1, int s2) { return ( find(s, s1) == find(s, s2) );} |

On each union, the tree with fewer nodes becomes the child. But how tall can such a tree get as a function of the number of nodes in it?

The minimum spanning tree algorithm has several interesting properties that help solve several closely related problems:

- Maximum Spanning Trees
- Minimum Product Spanning Trees
- Minimum Bottleneck Spanning Trees

The minimum spanning tree of a graph is unique if all m edge weights in the graph are distinct.

There are two important variants of a minimum spanning tree that are not solvable with these techniques.

- Steiner Tree
- Low-degree Spanning Tree

The shortest path from s to t in an unweighted graph can be constructed using a breadth-first search from s. The minimum-link path is recorded in the breadth-first search tree, and it provides the shortest path when all edges have equal weight.

However, BFS does not suffice to find shortest paths in weighted graphs. The shortest weighed path might use a large number of edges, just as the shortest route (timewise) from home to

Dijkstra’s algorithm is the method of choice for finding shortest paths in an edge and/or vertex-weighted graph. Given a particular start vertex s, it finds the shortest path from s to every other vertex in the graph, including your desired destination t.

Dijkstra’s algorithm proceeds in a series of rounds, where each round establishes the shortest path from *s* to *some* new vertex. Specially, x is the vertex that minimizes dist(s, v_i) + w(v_i, x) over all unfinished 1 \leq i \leq n, where w(i, j) is the length of the edge from i to j, and dist(i,j) is the length of the shortest path between them.

This suggests a dynamic programming-like strategy. If (s,y) is the lightest edge incident to s, then this implies that dist(s,y) = w(s,y). Once we determine the shortest path to a node x, we check all outgoing edges of x to see whether there is a better path from s to some unknown vertex through x.

The basic idea is very similar to Prim’s algorithm. In each iteration, we add exactly one vertex to the tree of vertices for which we know the shortest path from s. As in Prim’s, we keep track of the best path seen to date for all vertices outside the

The difference between Dijskstra’s and Prim’s algorithms is how they rate the desirability of each outside vertex. In the minimum spanning tree problem, all we care about was the weight of the next potential tree edge. In

The pseudocode actually obscures how similar the two algorithms are. In fact, the change is very minor. Below, we give an implementation of Dijkstra’s algorithm based on changing exactly three lines from our Prim’s implementation – one of which is simply the name of the function!

1 | dijkstra(graph *g, int start) { int i; /* counter */ edgenode *p; /* temporary pointer */ bool intree[MAXV+1]; /* is the vertex in the tree yet? */ int distance[MAXV+1]; /* cost of adding to tree */ int v; /* current vertex to process */ int w; /* candidate next vertex */ int weight; /* edge weight */ int dist; /* best current distance from start */ for (i=1; i<=g->nvertices; i++) { intree[i] = FALSE; distance[i] = MAXINT; parent[i] = -1; } distance[start] = 0; v = start; while (intree[v] == FALSE) { intree[v] = TRUE; p = g->edges[v]; while (p != NULL) { w = p->y; weight = p->weight; if ((distance[w] > distance[v]+weight) { /* CHANGED */ distance[w] = distance[v] + weight; /* CHANGED */ parent[w] = v; } p = p->next; } v = 1; dist = MAXINT; for (i=1; i<=g->nvertices; i++) if ((intree[i] == FALSE) && (dist > distance[i])) { dist = dist[i]; v = i; } }} |

This algorithm finds more than just the shortest path from s to t. It finds the shortest path from s to all over vertices. This defines a shortest path spanning tree rooted in s. For unweighted graphs, this would be the breadth-first search tree, but in general it provides the shortest path s to all other vertices.

What is the running time of Dijkstra’s algorithm? As implemented here, the complexity is O(n^2). This is the same running time as a proper version of Prim’s algorithm; except for the extension condition it is the same algorithm as Prim’s.

The length of the shortest path from `start`

to a given vertex `t`

is exactly the value of `distance`

. To find the actual path we follow the backward `parent`

pointers from `t`

until we hit `start`

, exactly as was done in the `find_path()`

routine.

Dijkstra works correctly only on graphs without negative-cost edges. The reason is that midway through the execution we may encounter an edge with weight so negative that it changes the cheapest way to get from s to some other vertex already in the tree.

Floyd’s algorithm is best employed on an adjacency matrix data structure, which is no extravagance since we must store all n^2 pairwise distances anyway. Our `adjacency_matrix`

type allocates space for the largest possible matrix, and keeps track of how many vertices are in the graph:

1 | typedef struct { int weight[MAXV+1][MAXV+1]; /* adjacency/weight info */ int nvertices; /* number of vertices in graph */} adjacency_matrix; |

The critical issue in an adjacency matrix implementation is how we denote the edges absent from the graph. A common convention for unweighted graphs denotes graph edges by 1 and non-edges by 0. This gives exactly the wrong interpretation if the numbers denote edge weights, for the non-edges get interpreted as a free ride between vertices. Instead, we should initialize each non-edge to `MAXINT`

. This way we can both test whether it is present and automatically ignore it in shortest-path computations, since only real edges will be used, provided `MAXINT`

is greater than the diameter of your graph.

There are several ways to characterize the shortest path between two nodes in a graph. The Floyd-Warshall algorithm starts by numbering the vertices of the graph from 1 to n. We use these numbers not to label the vertices, but to order them. Define W[i,j]^k to be the length of the shortest path from i to j using only vertices numbered from 1,2,\dots,k as possible intermediate vertices.

At each iteration, we allow a richer set of possible shortest paths by adding a new vertex as a possible intermediary. Allowing the vertex as a stop helps only if there is a short path that goes through k, so

W[i,j]^k = \min(W[i,j]^{k-1}, W[i,k]^{k-1} + W[k,j]^{k-1})The correctness of this is somewhat subtle, and I encourage you to convince yourself of it. But there is nothing subtle about how simple the implementation is:

1 | floyd(adjacency_matrix *g) { int i, j; /* dimension counters */ int k; /* intermediate vertex counter */ int through_k; /* distance through vertex k */ for (k=1; k<=g->nvertices; k++) for (i=1; i<=g->nvertices; i++) for (j=1; j<=g->nvertices; j++) { through_k = g->weight[i][k] + g->weight[k][j]; if (through_k < g->weight[i][j]) g->weight[i][j] = through_k; }} |

The Floyd-Warshall all-pairs shortest path runs in O(n^3) time, which is asymptotically no better than n calls to Dijkstra’s algorithm. However, the loops are so tight and the program so short that it runs better in practice. It is notable as one of the rare graph algorithms that . work better on adjacency matrices than adjacency lists.

The output of Floyd’s algorithm, as it is written, does not enable one to reconstruct the actual shortest path between any given pair of vertices. These paths can be recovered if we retain a parent matrix P of our choice of the last intermediate vertex used for each vertex pair (x, y). Say this value is *k*. The shortest path from *x* to *y* is the concatenation of the shortest path from *x* to *k* with the shortest path from *k* to *y*, which can be reconstructed recursively given the matrix *P*. Note, however, that most all-pairs applications need only the resulting distance matrix. These jobs are what Floyd’s algorithm was designed for.

Floyd’s algorithm has another important application, that of computing *transitive closure*. In analyzing a directed graph, we are often interested in which vertices are reachable from a given node.

The vertices reachable form any single node can be computed using breadth-first or depth-first searches. But the whole batch can be computed using an all-pairs shortest-path. If the shortest path from i to j remains `MAXINT`

after running Floyd’s algorithm, you can be sure no directed path exists from i to j.

Edge-weighted graphs can be computed as a network of pipes, where the weight of edge (i,j) determines the *capacity* of the pipe. Capacities can be though of as a function of the cross-sectional area of the pipe. The *network flow problem* asks for the maximum amount of flow which can be sent from vertices s to t in a given weighted graph G while respecting the maximum capacities of each pipe.

While the network flow problem is of independent interest, its primary important events is in to solving other important graph problems. A classic example is bipartite matching. A matching is a graph G = (V, E) is a subset of edges E’ \subset E such that no two edges of E’ share a vertex.

Graph G is bipartite or two-colorable if the vertices can be divided into two sets, L and R, such that all edges in G have one vertex in L and one vertex in R. Many naturally defined graphs are bipartite. Matching in these graphs have natural interpretations as job assignments or as marriages.

The largest bipartite matching can be readily founding using network flow. Create a source node s that is connected to every vertex in L by an edge of weigh 1. Create a sink node t and connect it to every vertex in R by an edge of weigh 1. Finally, assign each edge in the bipartite graph G a weight of 1. Now, the maximum possible flow from s to t defines the largest matching in G. (?)

Traditional network flow algorithms are based on the idea of augmenting paths, and repeatedly finding a path of positive capacity from s to t and adding it to the flow.

The key structure is the **residual flow graph**, denoted as R(G, f), where G is the input graph and f is the current flow through G. This directed, edge-weighted R(G, f) contains the same vertices as G. For each edge (,j) in G with capacity c(i,j) and flow f(i,j), R(G, f) may contain two edges:

- An edge (i,j) with weight c(i,j) – f(i,j), if c(i,j) – f(i,j) > 0 and
- an edge (j, i) with weight f(i,j), if f(i,j) > 0.

A set of edges whose deletion separates s from t (like the two edge incident to t) is called an s-t cut. Clearly, no s to t flow can exceed the weight of the minimum such cut. In fact, a flow equal to the minimum cut is always possible.

1 | typedef struct { int v; /* neighboring vertex */ int capacity; /* capacity of edge */ int flow; /* flow through edge */ int residual; /* residual capacity of edge */ struct edgenode *next; /* next edge in list */} edgenode; |

We use a breadth-first search to look for any path from source to sink that increases the total flow, and use it to augment the total. We terminate with the optimal flow when no such augmenting path exists.

1 | netflow(flow_graph *g, int source, int sink) { int volume; add_residual_search(g); initialize_search(g); bfs(g, source); volume = path_volume(g, source, sink, parent); while (volume > 0) { augment_path(g, source, sink, parent, volume); initialize_search(g); bfs(g, source); volume = path_volume(g, source, sink, parent); }} |

Any augmenting path from source to sink increases the flow, so we can use `bfs`

to find such a path in the appropriate graph. We only consider network edges that have remaining capacity, or in other words, positive residual flow. The predicate below helps [cci}bfs[/cci] distinguish between saturated and unsaturated edges:

1 | bool valid_edge(edgenode *e) { if (e->residual > 0) return (TRUE); else return(FALSE);} |

Augmenting a path transfers the maximum possible volume from the residual capacity into positive flow. This amount is limited by the path-edge with the smallest amount of residual capacity, just as the rate at which traffic can flow is limited by the most congested point.

1 | int path_volume(flow_graph *g, int start, int end, int parents[]) { edgenode *e; /* edge in question */ edgenode *find_edge(); /* forward declaration? */ if (parents[end] == -1) return(0); e = find_edge(g.parents[end], end); if (start == parents[end]) return(e->residual); else return ( min(path_volume(g, start, parents[end], parents), e->residual) );}edgenode *find_edge(flow_grpah *g, int x, int y) { edgenode *p; /* temporary pointer */ p = g->edges[x]; While (p != NULL) { if (p->v == y) return(p); p = p->next; } return(NULL);} |

Sending an additional unit of flow along directed edge (i,j) reduces the residual capacity of edge (i,j) but increases the residual capacity of edge (j, i). Thus, the act of augmenting a path requires modifying both forward and reverse edges for each link on the path.

1 | augment_path(flow_graph *g, int start, int end, int parents[], int volume) { edgenode *e; edgenode *find_edge(); if (start == end) return; e = find_edge(g, parents[end], end); e=>flow += volume; e->residual -= volume; e = find_edge(g, end, parents[end]); e->residual += volume; augment_path(g, start, parents[end], parents, volume);} |

Initializing the flow graph requires creating directed flow edges (i,j) and (j,i) for each network edge e = (i,j). Initial flows are all set to 0. The initial residual flow of (i,j) is set to the capacity of e, while the initial residual flow of (j, i) is set to 0.

The augmenting path algorithm above eventually converges on the optimal solution.

Edmonds and Karp proved that always selecting a shortest unweighted augmenting path guarantees that O(n^3) augmentations suffice for optimization. In fact, the Edmonds-Karp algorithm is what is implemented above, since a breach-first search from the source is used to find the next augmenting path.

Proper modeling is the key to making effective use of graph algorithms. We have defined several graph properties, and developed algorithms for computing them. All told, about two dozen different graph problems are presented I nt the catalog. These classical graph problems provide a framework for modeling most applications.

The secret is learning to design graphs, not algorithms. we have already seen a few instances of this idea:

- The maximum spanning tree

- To solve bipartite matching, we constructed a special network flow graph such that the maximum flow corresponds to a maximum cardinality matching.

In this paper, the authors proposed a *long-term feature bank* that stores a rich, time-indexed representation of the entire movie. Intuitively, the long-term feature bank stores features that encode information about past and (if available) feature scenes, objects, and actions. This information provides a supportive context that allows a video model, such as a 3D Convolutional network, to better infer what is happening in the present.

Authors describe how their method can be used for the task of *spatial-temporal action localization*, where the goal is to detect all actors in a video and classify their actions. Most state-of-the-art methods, combine a ‘backbone’ 3D CNN with a region-based person detector. To process a video, it is split into *short* clips of 2-5 seconds, which are *independently* forwarded through the 3D CNN to compute a feature map, which is then used with region proposals and region of interest (RoI) pooling to compute RoI features for each candidate actor. This approach, which captures only short-term information.

The central idea in this method is to extend this approach with two new concepts:

- a
**long-term feature bank**that intuitively acts as a ‘memory’ of what happened during the entire video – the authors compute this as RoI features from detections at regularly sampled time steps; and - a
**feature bank operator**(FBO) that computes interactions between the short-term RoI features (describing what actors are doing now) and the long-term features. The interactions may be computed through an attentional mechanism, such as a non-local block, or by feature pooling and concatenation.

The goal of the long-term feature bank, L, is to provide relevant contextual information to aid recognition at the current time step. For the task of spatial-temporal action localization, we run a person detector over the entire video to generate a set of detections for each frame.

]]>