The talk will focus on selected challenges in modern large-scale machine learning in two settings: i) large model (deep learning) setting and ii) large data setting. Despite the success of convex methods, deep learning methods, where the objective is inherently highly non-convex, have enjoyed a resurgence of interest in the last few years and they achieve state-of-the-art performances on a number of tasks. In the first part of the talk we present recent results regarding deep learning models. We demonstrate empirical evidence that for deep networks i) most local minima recovered by SGD-type techniques are equivalent and yield similar performance on a train/test set and ii) local minima recovered by these techniques are flat and generalize well. We show a construction of an algorithm, Entropy-SGD, that is tailored to explore the flat and well-generalizing parts of the energy landscape using local entropy regularization. We also show how the flatness of the landscape can be explored in the parallel optimization setting. The obtained algorithms empirically outperform state-o-the-art tools in terms of the running time and generalization abilities. As a part of the general theme of the talk (accelerating deep network models), the talk will next consider the case when the learning algorithm needs to be scaled to large data. The multi-class classification problem will be addressed, where the number of classes (k) is extremely large, with the goal of obtaining train and test time complexity logarithmic in the number of classes. A reduction of this problem to a set of binary classification problems organized in a tree structure will be discussed. A top-down online tree construction approach for constructing logarithmic depth trees will be demonstrated, which is based on a new objective function. The approach extends to the deep learning setting easily and along with efficient optimization tools presented in the first part of the talk can be used to develop efficient large deep learning systems. Finally, the talk will also mention some of the most recent work of Anna Choromanska and her lab. In particular: exploring orthogonality of parametrization of deep networks in the context of domain adaptation and network compression as well as deep learning structures for source separation problems.