NLP & LLMs

LLM Basics and Terminology

Large Language Models (LLMs) are probabilistic engines that predict the next token in a sequence based on massive datasets.
The Transformer architecture, specifically the self-attention mechanism, is the foundational technology enabling parallel processing of language.
Training involves three distinct stages: Pre-training, Supervised Fine-Tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF).
Understanding parameters, context windows, and tokenization is essential for optimizing model performance and managing computational costs.

Why It Matters

Legal industry

In the legal industry, firms like Harvey AI use LLMs to automate document review and contract analysis. By ingesting thousands of pages of case law and internal filings, the model can identify potential liabilities or inconsistencies in a fraction of the time it would take a human paralegal. This allows lawyers to focus on high-level strategy rather than manual text extraction.

Software engineering

In software engineering, GitHub Copilot utilizes LLMs to assist developers by suggesting code completions in real-time. The model is trained on vast repositories of open-source code, allowing it to understand syntax, library usage, and even complex architectural patterns. This significantly boosts developer productivity by reducing the need to search for boilerplate code or documentation.

Healthcare sector

In the healthcare sector, companies are exploring the use of LLMs for clinical documentation and patient interaction summarization. By transcribing doctor-patient conversations and converting them into structured SOAP notes (Subjective, Objective, Assessment, Plan), models reduce the administrative burden on physicians. This ensures that doctors spend more time engaging with patients and less time navigating electronic health records.

How it Works

The Architecture of Language

At its core, a Large Language Model is a statistical engine designed to predict the next token in a sequence. Imagine you are reading a book and someone covers the next word; your brain uses the context of the previous sentences to guess what comes next. An LLM does exactly this, but on a scale of trillions of tokens. It does not "understand" language in the human sense; rather, it maps linguistic patterns into a high-dimensional vector space where related concepts are clustered together.

The shift from older architectures like Recurrent Neural Networks (RNNs) to the Transformer architecture, introduced in the seminal paper "Attention Is All You Need" (Vaswani et al., 2017), changed everything. RNNs processed text sequentially, which was slow and struggled to remember information from the beginning of a long paragraph. Transformers, however, process the entire sequence simultaneously. This parallelization allows them to be trained on massive datasets, such as the entire public internet, leading to the "Large" in LLM.

The Training Pipeline

LLM development is generally divided into three phases. First is Pre-training, where the model learns the statistical structure of language by predicting missing words in billions of sentences. This phase is computationally expensive and requires massive GPU clusters. The output is a "base model" that is excellent at completing text but not necessarily good at following instructions.

Second is Supervised Fine-Tuning (SFT). Here, the base model is trained on a smaller, curated dataset of instruction-response pairs. This teaches the model how to behave like an assistant rather than just a text-completer. Finally, Alignment (often via RLHF) is used to ensure the model’s outputs are helpful, honest, and harmless. By having humans rank different model responses, we can use Reinforcement Learning to nudge the model toward preferred behaviors.

Challenges and Edge Cases

One of the most critical aspects of LLM usage is the concept of the "Context Window." While models like GPT-4 or Claude have expanded this window to hundreds of thousands of tokens, the "Lost in the Middle" phenomenon remains a challenge. Research shows that models are often better at retrieving information from the beginning or end of a prompt than from the middle. Furthermore, LLMs are sensitive to "Prompt Sensitivity," where changing a single word in a prompt can drastically alter the output quality.

Another edge case is "Catastrophic Forgetting." When you fine-tune a model on a new, specific task, it may lose the general knowledge it acquired during pre-training. Practitioners must use techniques like Parameter-Efficient Fine-Tuning (PEFT) or Low-Rank Adaptation (LoRA) to update only a small subset of weights, preserving the model's foundational capabilities while adapting it to new domains.

Common Pitfalls

LLMs are sentient or possess consciousness Learners often anthropomorphize models because they use natural language. It is crucial to remember that LLMs are purely mathematical functions mapping input distributions to output distributions, lacking any internal state or awareness.
More parameters always equal better performance While scaling laws suggest that larger models generally perform better, smaller, highly optimized models (like Mistral or Llama-3-8B) can often outperform massive models on specific tasks. Efficiency and data quality are frequently more important than raw parameter counts.
LLMs "know" facts People often treat LLMs as databases, but they are lossy compressors of information. They do not have a reliable way to verify facts, which is why they hallucinate; they prioritize the statistical likelihood of a word sequence over factual accuracy.
Fine-tuning is always necessary Many users jump to fine-tuning when they could achieve better results with Retrieval-Augmented Generation (RAG). Fine-tuning changes the model's behavior or style, while RAG provides the model with external, up-to-date knowledge without altering the weights.

Sample Code

Python

import torch
import torch.nn.functional as F

# Simple implementation of a single-head self-attention mechanism
def self_attention(q, k, v):
    # q, k, v are tensors of shape (seq_len, head_dim)
    d_k = q.size(-1)
    # Calculate scores: (seq_len, seq_len)
    scores = torch.matmul(q, k.transpose(-2, -1)) / (d_k ** 0.5)
    # Apply softmax to get attention weights
    weights = F.softmax(scores, dim=-1)
    # Return weighted sum of values
    return torch.matmul(weights, v)

# Example usage
seq_len, head_dim = 4, 8
q = torch.randn(seq_len, head_dim)
k = torch.randn(seq_len, head_dim)
v = torch.randn(seq_len, head_dim)

output = self_attention(q, k, v)
print("Attention Output Shape:", output.shape)
# Expected Output: Attention Output Shape: torch.Size([4, 8])