A few weeks ago, Elias Frantar and Dan Alistarh published a paper called SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. Here’s the abstract:
We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches.
In layman’s terms, pruning can is a way to make language models more compact—think of a flower garden where you prune your roses and get rid of the dead leaves but imagine that those dead leaves were taking up lots of space in the garden. It’s not a new concept but previously, it has affected quality of the models. But SparseGPT has apparently worked well for models with billions of parameters (and that’s the norm for LLMs now).