Tomer Galanti
Research
Utilizing both theory and experiments, my research develops realistic models of contemporary learning settings to guide practices in deep learning, LLMs, and AI. Some of my recent and past work explores:
-
What types of representations are learned by neural networks? In [1] we showed for the first time how the rank of the weight matrices are controlled by the training hyperparameters in modern neural networks (with residual connections, self-attention, convolutional layers, etc.). In [2, 3], we demonstrated how neural collapse propagates into intermediate layers of trained classifiers, and in [4], we showed that self-supervised learning algorithms produce representations that cluster based on semantic attributes. Recently, in [5], we introduced a framework unifying key deep learning phenomena, including neural collapse and the neural Ansatz, by explaining how latent representations, weights, and neuron gradients align during training.
-
How various properties of latent representations affect the model's compressibility and its ability to generalize and adapt to different tasks? For example, in [6, 7, 8, 9], we developed theory and algorithms linking clustering properties, such as neural collapse, to the ability of pre-trained classifiers to adapt to downstream tasks with minimal data. For compressibility, the results in [1] show how the training hyperparameters control the network's compressibility. Furthermore, in [10], we showed how hard-coded architectural sparsity, such as in convolutional networks, enhances generalization.
-
I also develop theory and algorithms for effectively training and accelerating LLMs. In [11], we introduced the Distributed Speculative Inference (DSI) algorithm, which accelerates LLM inference using multiple GPUs. DSI introduces a novel type of task parallelism called Speculation Parallelism (SP), which balances computational resources and latency by overlapping target and drafter instances. In [12], we demonstrated how to effectively train Autoregressive Decision Trees as coherent and grammatically correct language models. Finally, in [13], we proposed the “Fair Language Model Dilemma,” asserting that increasing weight decay leads to a tendency to neglect low-frequency tokens, which is detrimental to the model’s performance since low-frequency tokens constitute the majority of the vocabulary.