In this work, the authors make several surprising observations which contradict common beliefs. Their results have several implications: 1) training a large, over-parameterized model is not necessary to obtain an efficient final model, 2) learned “important” weights of the large model are not necessarily useful for the small pruned model, 3) the pruned architecture itself, rather a set of inherited “important” weights, is what leads to the efficiency benefit in the final model, which suggests that some pruning algorithms could be seen as performing network architecture search.