A recent paper by Zhiying Jiang, Matthew Yang, Mikhail Tsirlin, Raphael Tang, Yiqin Dai, and Jimmy Lin has proposed a way of performing text classification using gzip, a world-renowned data compression program. Here’s the abstract:
Deep neural networks (DNNs) are often used for text classification due to their high accuracy. However, DNNs can be computationally intensive, requiring millions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize, and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that’s easy, lightweight, and universal in text classification: a combination of a simple compressor like gzip with a k-nearest-neighbor classifier. Without any training parameters, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distribution datasets.It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also excels in the few-shot setting, where labeled data are too scarce to train DNNs effectively.
via ACL Anthology
There’s a lot of talk about democratising AI and making it as open source as possible, not just to compete with the power of commercial models but as a way to allow people to host this tech on their own machines. This would be a promising way to trial such an endeavour although I wouldn’t recommend using gzip at a production level; commercial models are still the best solution for that (besides, you know, hiring a human for generative purposes).
You can download the paper as a PDF to read at your own leisure.
Filed under: data machine learning natural language processing neural networks