the graphics processing unit (GPU) is historically the computer chip in computers which runs complex mathematics to generate visual graphics for display
these GPUs are now also used preferentially over CPUs as they are perhaps 10x more efficient and faster at processing vector mathematics which is the core of AI
nVidia is a leading manufacturer of GPUs and their GPUs have a technology called CUDA which specifically optimises vector maths
Python
the most commonly used computer software language used to create AI solutions
a computer system designed to have many decision point nodes in sequential layers of nodes
each node will have its own parameter weights and during training, these weights are iteratively optimised until they generate highest probability of selecting the correct output based on the input provided when compared to a known “correct” output
a complex-valued neural network (CVNN) is a potential new system that uses complex numbers not just real numbers
it may be built upon back-propagation using Riemannian geometry
a neural network layer which has been added to many machine learning neural network models and which is the core to the current LLMs and similar AI models
hyperparameters
parameter which can be changed by the person running the training of the model such as number of neuron nodes and layers, batch_size, number of epochs to run, the “learning rate”
token
AI models are not usually trained on words or characters but either sub-words or text segments of a given length (often 2-4 - ascertaining the best number is a bit of an art form) of characters each of which is called a token
when training a natural language model on a large amount of text, it is thus broken up into tokens and each unique token is assigned a vector value
during training of models, each unique token is connected to other tokens by a probability of association vector value - ie. if a certain token is being used - what is the most likely token that should follow it based on the training data?
when using a trained LLM, there is usually a limit on the number of tokens that can be used in the prompt and output by the model as a response
text is converted to tokens by a tokenizer
commercial LLMs generally charge for how many tokens are processed in each request
forward pass
the model training goes from one token to the next token and then re-computes parameter values
model parameters
each iteration of the model forward pass generates parameter values at each “neural node” - this are in general, the parameters or weights A and B, where hidden neural node value = A x previous value + B x token input value
a LLM may have billions of parameters stored
model loss
during training each iteration generates an output which is then compared to the known desired output and the difference is the “loss”
gradient descent
for a given set of parameters, the loss is calculated and when a small adjustment to the parameter is made the loss is again calculated and compared ton ascertain the gradient the loss value is moving for this adjustment - when the gradient value approaches zero, this generally reflects the optimum parameter values to get the lowest loss.
back propagation
in order to optimise the loss to its minimum possible value the model iteratively steps back through the prior neural network layer connections for each node and makes adjustments to the parameter values according to gradient descent analysis and then determines the new output and loss
epochs
the number of times the model runs through the input during training
in general, each epoch run should result in a lower loss function although this may plateau
over-fitting
when a model performs well on the training data and test data, but poorly on more generalized data
fine tuning terms
Reinforcement Learning from Human Feedback (RLHF)
costly, complicated and can be unstable in training and uses a separate Reward Model (RM)
Direct Preference Optimization (DPO)
fine tune from Human Feedback without Reinforcement Learning
better, cheaper, more stable than RLHF and doesn't use a RM
trained model terminology
large language model (LLM)
an AI model trained on a very large dataset of text - usually with over a billion parameters stored
when LLMs generates text that is erroneous, nonsensical, or detached from reality.
“Unlike databases or search engines, LLMs lack the ability to cite their sources accurately, as they generate text through extrapolation from the provided prompt. This extrapolation can result in hallucinated outputs that do not align with the training data.”
in order for very large models to function on machines with less video RAM, models can be reduced in size (quantized) by various methods such as truncating vector values from 32bit floating point values to integer values and even 4 or 5 bit values without losing too much inference accuracy
activation-aware weight qunatization (AWQ) applies different levels of quantization to different weights based upon their importance to create a more effective quantized model
this can be done in real time in python apps by installing and using autoawq and autoawq-kernels libraries
embeddings
this are data representations of text, images, sound, or video which AI models can process to create similarity searches, etc
they are usually in the form of a data store of vector values
there are many factors in how well a given embedding will perform in terms of accuracy, resource use costs, etc, examples are context size, and number of dimensions
an example is OpenAI's text-embedding-ada-002 embedding for text
Flash Attention 2 speeds up calculations using GPU memory hierarchy and avoiding costly matrix multiplications. Uses online softmax, better parallelism and work partitioning, and GPU memory saving tricks
offloading parameters - most deep learning models process neural layers in a fixed order, offloading can pre-dispatch the next layer parameters in the background, ahead of time. 1)
prompt engineering
the art of creating the best prompts to send to a model to give the best chance of getting a desired output
pointless being polite just get straight to the point
integrate the intended audience eg. “the audience is an expert in the field”
“Your task is”, “You MUST”, “You will be penalized”, “Answer a question given in a natural, human-like manner”
“I'm going to tip $(random amount) for a better solution!”
break down complex tasks or logical reasoning - “think step by step.”
use affirmative directives such as “do”, do not use negative language like “don't”.
repeat a specific word or phrase multiple times within a prompt
further fine tuning
Instruct tuning
these are generally designed to optimise models for conversational use such as ChatGPT, Mistral-7B-Instruct
safety tuning
adds restrictions to minimise anti-social abuses, etc
domain fine tuning
optimising the model for a certain role such as healthcare
Low-rank Adaptation (LoRA)
a fine tuning method which attempts to avoid the need to re-train all parameters (weights) in the model (which could be billions)
it does so by:
adding the desired changes to the weights
these changes are tracked in two smaller vector matrices (Low rank matrices - the greater the rank used - ie. the more columns in the 1st matrix, and the more rows in the 2nd matrix, the grater the precision of the output matrix when multiplied) which get multiplied together to form a matrix the same size as the original model's weight matrix
these then get added to the model's matrix on an as needed method
performing vector maths on both the low rank matrices is much faster and more efficient (only 2 x the number of parameters in each low rank matrix) than doing so on the full matrix which has the number of parameters of each low rank matrix squared
QLoRA
a quantized version of LoRA
text2image terminology
diffusion
an AI software technique for text to image generation
the system is trained on images which are sequentially blurred until they essentially become random dots - the mathematical equations used in each of these blurring or convolutional steps is stored
the diffusion process to create an image from text prompts then starts with a random dot image and then applies these mathematical equations in reverse order to create a new image
image analysis terms:
Region-of-Interest (ROI)
refers to image analysis where the AI analyses a specific region of the image and ascertains what is in that region
eg. GTP4-ROI vision model was fine tuned to perform this
referring
refers to image analysis where the AI analyses a RoI in an image and then ascertains what the object is and also its function
grounding referring
refers to image analysis where the AI analyses an image to locate a target visual object by understanding multimodal semantic concepts as well as relationships between referring natural language expressions (eg. in a Captcha image - it can find all the traffic lights)