Understanding GenAI Concepts
Artificial Intelligence Overview
An overview of basic concepts and terminology.
The term “artificial intelligence” is an imprecise one, with many definitions extant. Broadly speaking, the term is used to classify machines that simulate human learning, reasoning, and decision-making.
AI can be classified as “weak” or “strong.” Weak AI, also known as “Artificial Narrow Intelligence” (ANI), refers to systems designed to perform a specific task or set of tasks. Chatbots, voice assistants such as Apple’s Siri or Amazon’s Alexa, facial recognition systems, and autonomous vehicles are some examples of ANI. Much research is aimed at achieving Strong AI, or "Artificial General Intelligence" (AGI). These systems do not yet exist but would possess the ability to accomplish a wide array of reasoning and tasks on par with a human. Even more theoretical (and currently the realm of science fiction) is the notion of "Artificial Super Intelligence" (ASI), in which machines surpass human intelligence and capabilities, perhaps even achieving a form of consciousness.
Machine Learning is a branch of artificial intelligence that builds algorithms from data sets to enable systems to imitate human learning and decision-making. It accomplishes this by extracting patterns from data, emphasizing feature engineering to develop decision algorithms. Such systems develop and update internal models based on “learning” – ingesting structured labeled or unlabeled data, making predictions, and adjusting their internal models based upon whether decision outcomes are correct or incorrect.
An artificial neural network (ANN) is an extension of machine learning inspired
by the basic organizing principle of biological neurons within the brain. ANNs are
comprised of layers of nodes beginning with an input layer, followed by one or more
hidden layers, and ending with an output layer.
Nodes (essentially artificial neurons) connect to each other in sequence, each
representing a decision attribute, possessing a weight and threshold (or bias) at
which the neuron “activates,” passing on data to the next layer in the network. This
enables the system to pass on multiple layers of decision summations, eventually reaching
the output layer, at which point the output nodes can be evaluated and a final decision
made (e.g. is an input image a cat or a dog?).
ANNs learn by adjusting the weights across nodes based on information fed into
the network. This process is known as model training. The larger the number of weights
– referred to as the model’s parameters – the more complex and refined the model’s
capabilities. The more data that the system can absorb, the better the parameter estimates
become and the more accurate the overall model.
ANNs that consist of three or more layers are referred to as deep learning models:
-IBM
Foundation models are neural network models trained on extremely large, broad-based data sets that enable them to be used for a wide variety of tasks across multiple domains.They can be leveraged as a base to develop more complex, specialized models, greatly accelerating the AI development lifecyle. Google’s BERT and Gemini, OpenAI’s GPT and DALL-E, and Meta’s Llama, are examples of well-known foundation models.
AI models are trained by feeding them large amounts of data and allowing them to learn from patterns and relationships within that data. This data can come from many sources: web scraping, direct user input (e.g. recording prompt interactions in AI chat tools), capturing user data from other sources (e.g. audio, video, screenshots, data files, etc.), or even synthetic data – artificial data generated by other AI systems specifically for the purpose of model training. The models are subjected to an iterative process of observing inputs and outputs, which allows the system to adjust the parameters within their neural network to improve output quality.
There are many approaches to AI model training, however, they can be broadly classified
as supervised, unsupervised, or reinforcement.
Supervised learning uses data sets that have been manually labeled or pre-classified by humans, allowing
the model to iteratively make predictions on the data then adjust for the correct
answer. Unsupervised learning relies on the model to determine common patterns within
unlabeled data (e.g. by grouping data into related clusters or identifying relationships
between variables within a data set).
Unsupervised learning allows for greater automation, although some human intervention is still required to validate outputs to ensure accurate results. Reinforcement learning allows the model to attempt to solve a task in a trial-and-error fashion through interaction with its environment based on a reward function that reinforces successful outcomes. Because it can be difficult to create an effective reward function, the reward model is often trained via direct human feedback – in this case, the technique is referred to as Reinforcement Learning from Human Feedback.
Generative Artificial Intelligence (GenAI) and Large Language Models
Explanation of GenAI and what makes it unique with respect to AI more broadly.
Generative Artificial Intelligence relies on deep learning models (artificial neural networks consisting of three or more layers) to identify and encode patterns and contextual relationships in large data sets. These models use this information to understand user requests and generate responses. Responses can be text, images, audio, video, or software code. They often rely on transformer based large language models (LLMs) but can use other model approaches such as generative adversarial networks (GANs), variational autoencoders (VAEs), or diffusion models.
Large language models (LLMs) are a specialized form of multi-layered neural net (specifically using an architecture known as a transformer) that are trained on large collections of natural language text. This data is typically collected from the internet, user input to GenAI tools, or from other available sources. LLMs are a type of foundation model, which are models trained on extremely large data sets in order to provide foundational capabilities applicable to a wide variety of use cases, as opposed to more specialized, domain-specific models.
LLMs are designed to understand and generate text, including inferring semantic context and generating relevant responses. They do this by leveraging billions of parameters within an artificial neural network to map contextual relationships in language. These techniques can be applied to non-linguistic tasks as well, such as image or audio processing.
A transformer is a neural network model first proposed in a famous 2017 paper by researchers from Google and the University of Toronto titled “Attention is All You Need.” Transformer models enable input strings to be processed in parallel and to selectively retain sequential information inputs. This allows the model to better account for contextual relationships, speed up training, and improve efficiency by distributing computations over multiple processing units (typically graphics processing units, or GPUs for short).
OpenAI’s GPT (Generative Pre-Trained Transformer) and Google’s BERT (Bidirectional Encoder Representations from Transformers) were the first major applications of the transformer approach. While originally developed for natural language processing (particularly machine translation), transformer models have since been adapted to a variety of applications, including natural language processing, image and video processing, audio processing, etc.
Because computers cannot understand language directly, text must be converted to numbers
that machines can act upon computationally (primarily using vector and matrix mathematics).
GenAI systems first break words into smaller components or subwords that can be represented
as numbers. These are referred to as tokens. Tokens are then compared to other tokens
to create a vector representation of semantic relationships – essentially an ordered
series of numbers that depict how closely related a given token is to another. These
vectors are known as embeddings. Similar tokens will have similar embeddings, thus
allowing the system to determine context by applying statistical techniques to input
word sequences and subsequently calculating appropriate relevant responses.
Context window refers to the number of tokens an AI model can consider at once (note that a token is smaller than a word, so the word limit will be somewhat smaller than the listed limit). It can be thought of as the memory span of the model. A larger context window allows a model to retain more information from an interaction (such as a series of prompt exchanges), providing more coherent, relevant responses. Context windows have steadily increased over time – ChatGPT’s original context window was limited to 4000 tokens, while the GPT-4o model is 128,000. Google’s latest Gemini 1.5 Pro model has a context window of 2,000,000 tokens, although this is currently limited to select developers.
While larger context windows provide significant advantages, there are nevertheless tradeoffs involved. Larger context windows require more computational resources to process requests. This can increase processing time and user costs (many GenAI services charge by token). In addition, inputting too much information (referred to as prompt stuffing) can bog down the model with extraneous details, causing the system to miss key information, resulting in less accurate results and increasing the likelihood of AI hallucination.
Retrieval Augmented Generation (RAG) is a technique that links LLMs to external knowledge
bases (e.g. peer reviewed papers, company documents, Wikipedia, etc.). This improves
both the efficiency and accuracy of models by allowing the LLM to search for and use
relevant information as needed, rather than relying on extensive input prompts, thus
reducing the need for large context windows and prompt stuffing (i.e. inputting too
much information). The approach has the additional benefits of extendibility, durability
(the data can be kept current), improved flexibility, and provides the ability to
account for data sources and ensure factual accuracy. However, it does require greater
upfront effort and complexity associated with building and managing the retrieval
system.