AI terminology

common AI terms

GPU
- the graphics processing unit (GPU) is historically the computer chip in computers which runs complex mathematics to generate visual graphics for display
- these GPUs are now also used preferentially over CPUs as they are perhaps 10x more efficient and faster at processing vector mathematics which is the core of AI
- nVidia is a leading manufacturer of GPUs and their GPUs have a technology called CUDA which specifically optimises vector maths
Python
- the most commonly used computer software language used to create AI solutions
- see Python language
machine learning terms
- see AI deep learning
- neural network
  - a computer system designed to have many decision point nodes in sequential layers of nodes
  - each node will have its own parameter weights and during training, these weights are iteratively optimised until they generate highest probability of selecting the correct output based on the input provided when compared to a known “correct” output
  - a complex-valued neural network (CVNN) is a potential new system that uses complex numbers not just real numbers
    - it may be built upon back-propagation using Riemannian geometry
  - see AI machine learning
- transformer
  - a neural network layer which has been added to many machine learning neural network models and which is the core to the current LLMs and similar AI models
- hyperparameters
  - parameter which can be changed by the person running the training of the model such as number of neuron nodes and layers, batch_size, number of epochs to run, the “learning rate”
- token
  - AI models are not usually trained on words or characters but either sub-words or text segments of a given length (often 2-4 - ascertaining the best number is a bit of an art form) of characters each of which is called a token
  - when training a natural language model on a large amount of text, it is thus broken up into tokens and each unique token is assigned a vector value
  - during training of models, each unique token is connected to other tokens by a probability of association vector value - ie. if a certain token is being used - what is the most likely token that should follow it based on the training data?
  - when using a trained LLM, there is usually a limit on the number of tokens that can be used in the prompt and output by the model as a response
  - text is converted to tokens by a tokenizer
  - commercial LLMs generally charge for how many tokens are processed in each request
- forward pass
  - the model training goes from one token to the next token and then re-computes parameter values
- model parameters
  - each iteration of the model forward pass generates parameter values at each “neural node” - this are in general, the parameters or weights A and B, where hidden neural node value = A x previous value + B x token input value
  - a LLM may have billions of parameters stored
- model loss
  - during training each iteration generates an output which is then compared to the known desired output and the difference is the “loss”
- gradient descent
  - for a given set of parameters, the loss is calculated and when a small adjustment to the parameter is made the loss is again calculated and compared ton ascertain the gradient the loss value is moving for this adjustment - when the gradient value approaches zero, this generally reflects the optimum parameter values to get the lowest loss.
- back propagation
  - in order to optimise the loss to its minimum possible value the model iteratively steps back through the prior neural network layer connections for each node and makes adjustments to the parameter values according to gradient descent analysis and then determines the new output and loss
- epochs
  - the number of times the model runs through the input during training
  - in general, each epoch run should result in a lower loss function although this may plateau
- over-fitting
  - when a model performs well on the training data and test data, but poorly on more generalized data
- fine tuning terms
  - Reinforcement Learning from Human Feedback (RLHF)
    - costly, complicated and can be unstable in training and uses a separate Reward Model (RM)
  - Direct Preference Optimization (DPO)
    - fine tune from Human Feedback without Reinforcement Learning
    - better, cheaper, more stable than RLHF and doesn't use a RM
trained model terminology
- large language model (LLM)
  - an AI model trained on a very large dataset of text - usually with over a billion parameters stored
  - see AI Large Language Models (LLMs)
- hallucinate
  - when LLMs generates text that is erroneous, nonsensical, or detached from reality.
  - “Unlike databases or search engines, LLMs lack the ability to cite their sources accurately, as they generate text through extrapolation from the provided prompt. This extrapolation can result in hallucinated outputs that do not align with the training data.”
  - see https://masterofcode.com/blog/hallucinations-in-llms-what-you-need-to-know-before-integration
- quantization
  - in order for very large models to function on machines with less video RAM, models can be reduced in size (quantized) by various methods such as truncating vector values from 32bit floating point values to integer values and even 4 or 5 bit values without losing too much inference accuracy
  - activation-aware weight qunatization (AWQ) applies different levels of quantization to different weights based upon their importance to create a more effective quantized model
    - this can be done in real time in python apps by installing and using autoawq and autoawq-kernels libraries
- embeddings
  - this are data representations of text, images, sound, or video which AI models can process to create similarity searches, etc
  - they are usually in the form of a data store of vector values
  - there are many factors in how well a given embedding will perform in terms of accuracy, resource use costs, etc, examples are context size, and number of dimensions
  - an example is OpenAI's text-embedding-ada-002 embedding for text
  - see also https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
- inference
  - the process of running live data through a trained AI model to make a prediction or solve a task
- CPU/GPU inference engine
  - a software tool which runs inferences on LLMs faster than usual so they can be run on lower end machines via their CPUs or a lower end GPU
  - eg. llama.cpp, PowerInfer (uses “activation locality” -see https://www.youtube.com/watch?v=jH9bAbmqi2I)
  - Flash Attention 2 speeds up calculations using GPU memory hierarchy and avoiding costly matrix multiplications. Uses online softmax, better parallelism and work partitioning, and GPU memory saving tricks
  - offloading parameters - most deep learning models process neural layers in a fixed order, offloading can pre-dispatch the next layer parameters in the background, ahead of time. ¹⁾
- prompt engineering
  - the art of creating the best prompts to send to a model to give the best chance of getting a desired output
  - principled instructions for better prompting - see https://arxiv.org/pdf/2312.16171v1.pdf
    - pointless being polite just get straight to the point
    - integrate the intended audience eg. “the audience is an expert in the field”
    - “Your task is”, “You MUST”, “You will be penalized”, “Answer a question given in a natural, human-like manner”
    - “I'm going to tip $(random amount) for a better solution!”
    - break down complex tasks or logical reasoning - “think step by step.”
    - use affirmative directives such as “do”, do not use negative language like “don't”.
    - repeat a specific word or phrase multiple times within a prompt
- further fine tuning
  - Instruct tuning
    - these are generally designed to optimise models for conversational use such as ChatGPT, Mistral-7B-Instruct
  - safety tuning
    - adds restrictions to minimise anti-social abuses, etc
  - domain fine tuning
    - optimising the model for a certain role such as healthcare
  - Low-rank Adaptation (LoRA)
    - a fine tuning method which attempts to avoid the need to re-train all parameters (weights) in the model (which could be billions)
    - it does so by:
      - adding the desired changes to the weights
      - these changes are tracked in two smaller vector matrices (Low rank matrices - the greater the rank used - ie. the more columns in the 1st matrix, and the more rows in the 2nd matrix, the grater the precision of the output matrix when multiplied) which get multiplied together to form a matrix the same size as the original model's weight matrix
      - these then get added to the model's matrix on an as needed method
      - performing vector maths on both the low rank matrices is much faster and more efficient (only 2 x the number of parameters in each low rank matrix) than doing so on the full matrix which has the number of parameters of each low rank matrix squared
  - QLoRA
    - a quantized version of LoRA
text2image terminology
- diffusion
  - an AI software technique for text to image generation
  - the system is trained on images which are sequentially blurred until they essentially become random dots - the mathematical equations used in each of these blurring or convolutional steps is stored
  - the diffusion process to create an image from text prompts then starts with a random dot image and then applies these mathematical equations in reverse order to create a new image
image analysis terms:
- Region-of-Interest (ROI)
  - refers to image analysis where the AI analyses a specific region of the image and ascertains what is in that region
  - eg. GTP4-ROI vision model was fine tuned to perform this
- referring
  - refers to image analysis where the AI analyses a RoI in an image and then ascertains what the object is and also its function
- grounding referring
  - refers to image analysis where the AI analyses an image to locate a target visual object by understanding multimodal semantic concepts as well as relationships between referring natural language expressions (eg. in a Captcha image - it can find all the traffic lights)

¹⁾

https://arxiv.org/abs/2312.17238