An interesting approach to pruning large language models

an image comparing magnitude pruning of an LLM with Wanda pruning of an LLM

Large language models (LLM) are notoriously huge and expensive to work with. An LLM requires a lot of specialized hardware to train and manipulate. We’ve seen efforts to transform and quantize the models that result in smaller footprints and models that run more readily on commodity software but at the cost of performance.  Now we’re seeing efforts to make the models smaller but still perform as well as the full model.

This paper, A Simple and Effective Pruning Approach for Large Language Models, introduces us to Wanda (Pruning by Weights and activations). Here’s the synopsis:

As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prune weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method on LLaMA across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and competes favorably against recent methods involving intensive weight update. Code is available at this https URL.

As noted the code behind that paper is readily available on Github at for everyone to try.

I think these advances in working with large language models are going to make it more economical for us to host our models and incorporate various NLP and deep learning techniques into our work.

Colarusso: Sample Notebook for Extracting Data from OCRed PDFs Using Regex and LLMs

One can use this notebook to build a pipeline to parse and extract data from OCRed PDF files. Warning: When using LLMs for entity extraction, be sure to perform extensive quality control. They are very susceptible to distracting language (latching on to text that sound “kind of like” what you’re looking for) and missing language (making up content to fill any holes), and importantly, they do NOT provide any hints to when they may be erroring. You need to make sure random audits are part of your workflow! Below we’ve worked out a workflow using regular expressions and LLMs to parse data from zoning board orders, but the process is generalizable.

Collect a set of PDFs
Place OCRed PDFs into the data folder
Write regexes to pull out data
Write LLM prompts to pull out data

A Jupyter notebook to extract data from PDFs. Useful stuff

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. Today we are excited to introduce vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.

vLLM has been developed at UC Berkeley and deployed at Chatbot Arena and Vicuna Demo for the past two months. It is the core technology that makes LLM serving affordable even for a small research team like LMSYS with limited compute resources. Try out vLLM now with a single command at our GitHub repository.

Emerging Architectures for LLM Applications | Andreessen Horowitz

Large language models are a powerful new primitive for building software. But since they are so new—and behave so differently from normal computing resources—it’s not always obvious how to use them.

In this post, we’re sharing a reference architecture for the emerging LLM app stack. It shows the most common systems, tools, and design patterns we’ve seen used by AI startups and sophisticated tech companies. This stack is still very early and may change substantially as the underlying technology advances, but we hope it will be a useful reference for developers working with LLMs now.

Github :: free-response-scoring by David Colarusso

This repository shares code used to implement the methods described in Unsupervised Machine Scoring of Free Response Answers—Validated Against Law School Final Exams, presented at the Computational Legal Studies Conference, March 2022, hosted by the Center for Computational Law at Singapore Management University.

You can find links to all relevant content either in, or linked to from, the notebook titled Score Exams.


Good alternative to LLM text comparison. Note: patent pending Suffolk University

Reddit :: Tutorial – train your own llama.cpp mini-ggml-model from scratch!

Tutorial – train your own llama.cpp mini-ggml-model from scratch!
by u/Evening_Ad6637 in LocalLLaMA

Here I show how to train with llama.cpp your mini ggml model from scratch! these are currently very small models (20 mb when quantized) and I think this is more fore educational reasons (it helped me a lot to understand much more, when “create” an own model from.. nothing before. And it helps to understand the parameters and their effects much better)

Otherwise, these mini models could be good enough to be experts on very specific fields, like: only gives text in the style of someone. Like one model could speak like cartman from southpark, another could be a poem and you could implement these ‘person’ in your general chat or role play coversations as supporting roles or minor roles.. to make “group” chats, brainstormings, etc.

And: the discussions on github seems to be very promissing that we will soon be able to fine tune pre-trained big models like llama or vicuna and so on. espcially creating (q)lora adapters should be possible soon : )

this will be the next game changer i think (imagine your model could be finetuned in real time incrementally on top of its lora adapter and with your current conversation as the dataset – what awesome implications would this mean?)


You maybe need the training-script

Tutorial – train your own llama.cpp mini-ggml-model from scratch!

From Medium :: Run Very Large Language Models on Your Computer | by Benjamin Marie | Towards AI

New large language models are publicly released almost every month. They are getting better and larger.

You may assume that these models can only be run on big clusters or in the cloud.

Fortunately, this is not the case. Recent versions of PyTorch propose several mechanisms that make the use of large language models relatively easy on a standard computer and without much engineering, thanks to the Hugging Face Accelerate package.

Source: Run Very Large Language Models on Your Computer | by Benjamin Marie | Towards AI

This website uses a Hackadelic PlugIn, Hackadelic Sliding Notes 1.6.5.