deep_learning

Deep Learning

A comprehensive reference for deep learning fundamentals, architectures, training techniques, and regularization methods.

Neural Network Fundamentals

Neurons and Layers

A neuron computes a weighted sum of inputs, adds a bias, and applies an activation function:

z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b = W·x + b
output = activation(z)

Layer types:

Input layer: Receives raw features
Hidden layers: Learn intermediate representations
Output layer: Produces final predictions

import torch
import torch.nn as nn

class NeuralNetwork(nn.Module):
    def __init__(self, input_size, hidden_sizes, output_size):
        super().__init__()

        layers = []
        prev_size = input_size
        for hidden_size in hidden_sizes:
            layers.extend([
                nn.Linear(prev_size, hidden_size),
                nn.BatchNorm1d(hidden_size),
                nn.ReLU(),
                nn.Dropout(0.3)
            ])
            prev_size = hidden_size
        layers.append(nn.Linear(prev_size, output_size))

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

Universal Approximation Theorem

A neural network with a single hidden layer containing enough neurons can approximate any continuous function on a compact domain to arbitrary accuracy. This is why neural networks are such powerful function approximators.

Activation Functions

Function	Formula	Range	Properties
Sigmoid	1/(1+e^(-x))	(0, 1)	Saturates → vanishing gradient
Tanh	(eˣ - e^(-x))/(eˣ + e^(-x))	(-1, 1)	Zero-centered, still saturates
ReLU	max(0, x)	[0, ∞)	Fast, no vanishing gradient for x>0
Leaky ReLU	max(0.01x, x)	(-∞, ∞)	Fixes dying ReLU problem
GELU	x·Φ(x)	(-∞, ∞)	Used in Transformers, smooth
SiLU/Swish	x·σ(x)	(-∞, ∞)	Smooth, used in modern architectures
Softmax	eˣⁱ / Σeˣʲ	(0, 1), sums to 1	Multi-class output layer

import torch
import torch.nn.functional as F

x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])

print(F.sigmoid(x))   # 0.12, 0.27, 0.50, 0.73, 0.88
print(F.tanh(x))      # -0.96, -0.76, 0.0, 0.76, 0.96
print(F.relu(x))      # 0, 0, 0, 1, 2
print(F.leaky_relu(x, 0.01))  # -0.02, -0.01, 0, 1, 2
print(F.gelu(x))      # -0.046, -0.16, 0, 0.84, 1.95
print(F.softmax(x, dim=0))  # Probabilities summing to 1

Training

Forward Pass

Data flows through the network layer by layer, applying weights, biases, and activations.

Input x → Layer 1 → Layer 2 → ... → Output ŷ
Loss L = loss_function(ŷ, y)

Backpropagation

Computes gradients of the loss with respect to each parameter using the chain rule.

∂L/∂W₁ = ∂L/∂ŷ · ∂ŷ/∂h₂ · ∂h₂/∂h₁ · ∂h₁/∂W₁

Chain rule example (2-layer network):
∂L/∂W₁ = (∂L/∂a₂) · (∂a₂/∂z₂) · (∂z₂/∂a₁) · (∂a₁/∂z₁) · (∂z₁/∂W₁)

Gradient Descent Variants

Optimizer	Update Rule	Strengths	Weaknesses
SGD	w ← w - η∇L	Simple, memory efficient	Noisy, slow convergence
SGD + Momentum	v ← βv + ∇L; w ← w - ηv	Faster convergence	Learning rate tuning
AdaGrad	w ← w - η/√(G+ε)·∇L	Adaptive per-param LR	LR decreases monotonically
RMSProp	w ← w - η/√(E[g²]+ε)·∇L	Fixes AdaGrad issue	No bias correction
Adam	Combines momentum + RMSProp	Fast, robust default	Memory usage, can overfit
AdamW	Adam + weight decay	Best default for deep learning	Slightly more memory

import torch.optim as optim

# Common optimizers
model = NeuralNetwork(784, [256, 128], 10)

sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)
adam = optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))
adamw = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

# Learning rate schedulers
scheduler = optim.lr_scheduler.CosineAnnealingLR(adam, T_max=100)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(adam, patience=5, factor=0.5)
scheduler = optim.lr_scheduler.OneCycleLR(adam, max_lr=0.01, total_steps=1000)

Regularization Techniques

Dropout

Randomly sets a fraction of neurons to zero during training, preventing co-adaptation.

# During training: randomly zero with probability p
# During inference: scale by (1-p) — or use inverted dropout (PyTorch default)

dropout = nn.Dropout(p=0.5)  # 50% dropout
# Standard: 0.1-0.5 for hidden layers, 0.0-0.1 for output

Why it works: Forces the network to learn redundant representations. Acts like an ensemble of 2^n subnetworks.

Batch Normalization

Normalizes layer inputs to have zero mean and unit variance within each mini-batch.

# Applied between linear layer and activation
nn.Sequential(
    nn.Linear(256, 128),
    nn.BatchNorm1d(128),   # Normalize across batch dimension
    nn.ReLU()
)

# For images
nn.Conv2d(64, 128, 3, padding=1)
nn.BatchNorm2d(128)   # Normalize across spatial and batch dimensions
nn.ReLU()

Why it works: Reduces internal covariate shift, allows higher learning rates, acts as regularizer, reduces sensitivity to initialization.

Layer Normalization

Normalizes across the feature dimension (not batch). Used in Transformers.

# Preferred in Transformers and RNNs where batch size varies
nn.LayerNorm(normalized_shape=512)

Weight Decay (L2 Regularization)

# Add L2 penalty to weights via optimizer
optimizer = optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.01  # L2 regularization coefficient
)

Early Stopping

class EarlyStopping:
    def __init__(self, patience=7, min_delta=1e-4):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = float('inf')

    def __call__(self, val_loss):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
        else:
            self.counter += 1
        return self.counter >= self.patience

Gradient Clipping

# Prevent exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Architecture Overview

Convolutional Neural Networks (CNNs)

Process spatial data (images, time series) using local filters.

Input Image → [Conv → BN → ReLU → Pool] × N → Flatten → FC → Output

Key layers:
- Conv2d: Learn local features using learned filters
- MaxPool2d: Spatial downsampling, translation invariance
- BatchNorm2d: Stabilize training
- Global Average Pooling: Reduce spatial dims without flatten

Famous architectures: AlexNet, VGG, ResNet, EfficientNet, ConvNeXt

# ResNet-style block with skip connection
class ResBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x):
        residual = x
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += residual  # Skip connection
        return F.relu(out)

Recurrent Neural Networks (RNNs) and LSTMs

Process sequential data with memory of previous steps.

RNN: hₜ = tanh(Wₓxₜ + Wₕhₜ₋₁ + b)
LSTM adds: forget gate, input gate, output gate, cell state
GRU: Simpler version of LSTM with update and reset gates

Problems with vanilla RNNs:
- Vanishing gradients over long sequences
- Can't capture long-range dependencies

# LSTM for sequence classification
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)
        self.dropout = nn.Dropout(0.3)

    def forward(self, x):
        embedded = self.dropout(self.embedding(x))
        output, (hidden, cell) = self.lstm(embedded)
        # Concat final hidden states from both directions
        hidden = torch.cat((hidden[-2], hidden[-1]), dim=1)
        return self.fc(self.dropout(hidden))

Transformers

See Transformers Deep Dive for detailed coverage.

BERT (encoder-only): Bidirectional context, great for classification, NER, QA
GPT (decoder-only): Autoregressive generation, text completion
T5/BART (encoder-decoder): Seq2Seq, translation, summarization

Fine-Tuning LLMs (LoRA / QLoRA / PEFT)

See Fine-Tuning Guide for detailed coverage.

LoRA — low-rank adaptation, trains only A·B matrices, <1% of parameters
QLoRA — NF4 quantization + LoRA, fine-tune 7B models on a single GPU
Instruction tuning — teach models to follow natural language instructions
DPO — preference optimization without RL infrastructure

Computer Vision

See Computer Vision Guide for detailed coverage.

CNNs — convolutions, residual blocks, receptive fields
Architectures — ResNet, EfficientNet, MobileNet, ViT
Object detection — Faster R-CNN, YOLO, mAP, NMS, IoU
Segmentation — U-Net, semantic vs instance
Augmentation — CutMix, MixUp, RandAugment

Diffusion Models

Generative models that learn to reverse a gradual noising process.

Forward process: x₀ → x₁ → ... → xₜ (add Gaussian noise)
Reverse process: xₜ → xₜ₋₁ → ... → x₀ (learned denoising)

Used in: Stable Diffusion (images), DALL-E 3, AudioLDM (audio), protein generation.

Transfer Learning and Fine-Tuning

Approaches

Approach	When to Use	Example
Feature extraction	Small dataset, similar domain	ResNet features for medical images
Fine-tuning last layers	Moderate dataset, similar domain	BERT for text classification
Full fine-tuning	Large dataset, sufficient compute	GPT for domain-specific generation
Few-shot prompting	Very small dataset	Prompt engineering with LLMs

import torchvision.models as models

# Load pretrained model
resnet = models.resnet50(pretrained=True)

# Option 1: Feature extraction — freeze all but classifier
for param in resnet.parameters():
    param.requires_grad = False

# Replace classifier head for new task
num_features = resnet.fc.in_features
resnet.fc = nn.Linear(num_features, num_classes)
# Only resnet.fc parameters will be updated

# Option 2: Fine-tune last few layers
for param in resnet.layer4.parameters():
    param.requires_grad = True
for param in resnet.fc.parameters():
    param.requires_grad = True

# Different learning rates for different layers
optimizer = optim.AdamW([
    {"params": resnet.layer4.parameters(), "lr": 1e-4},
    {"params": resnet.fc.parameters(), "lr": 1e-3}
])

Loss Functions

Loss	Use Case	Formula
MSE	Regression	`(1/n)Σ(y - ŷ)²`
MAE	Regression (outlier robust)	`(1/n)Σ\|y - ŷ\|`
Binary Cross-Entropy	Binary classification	`-[y·log(ŷ) + (1-y)·log(1-ŷ)]`
Categorical Cross-Entropy	Multi-class classification	`-Σ yᵢ·log(ŷᵢ)`
Hinge Loss	SVM, margin-based	`max(0, 1 - y·ŷ)`
KL Divergence	Distribution matching	`Σ P·log(P/Q)`
Contrastive / Triplet	Embeddings, metric learning	`max(0, d(a,p) - d(a,n) + margin)`

import torch.nn as nn

criterion_mse = nn.MSELoss()
criterion_mae = nn.L1Loss()
criterion_bce = nn.BCEWithLogitsLoss()   # More stable than BCELoss
criterion_ce = nn.CrossEntropyLoss()     # Includes softmax
criterion_ce_weighted = nn.CrossEntropyLoss(weight=class_weights)  # For imbalance

Interview Q&A

Q1: What is the vanishing gradient problem and how do we solve it? 🟡 Intermediate

During backpropagation through deep networks, gradients are multiplied by the derivative of each activation function. Sigmoid/tanh derivatives are < 1, so gradients shrink exponentially through layers, preventing early layers from learning. Solutions: (1) ReLU activations (gradient = 1 for positive inputs), (2) residual/skip connections (gradient flows directly), (3) batch normalization (normalizes inputs, prevents saturation), (4) careful initialization (Xavier/He), (5) gradient clipping.

Q2: Explain backpropagation using the chain rule. 🔴 Advanced

Backprop computes ∂L/∂w for all parameters. For a 2-layer network: L = loss(output, y), output = σ(W₂·h), h = σ(W₁·x).

Chain rule: ∂L/∂W₁ = ∂L/∂output · ∂output/∂h · ∂h/∂W₁

Each term is computed locally: ∂L/∂output from the loss, ∂output/∂h = W₂ · σ'(W₂·h), ∂h/∂W₁ = σ'(W₁·x) · xᵀ. PyTorch's autograd engine records operations during the forward pass (computational graph) and traverses it in reverse for backprop.

Q3: What is batch normalization and why does it help? 🟡 Intermediate

Batch normalization normalizes a layer's inputs to zero mean and unit variance within each mini-batch: x̂ = (x - μB) / √(σB² + ε), then scales and shifts with learned parameters γ and β: y = γx̂ + β. Benefits: reduces internal covariate shift (distribution of layer inputs changing during training), allows higher learning rates, acts as a regularizer (noise from batch statistics), reduces sensitivity to weight initialization. At inference, uses running mean/variance computed during training.

Q4: What is the difference between BatchNorm and LayerNorm? 🟡 Intermediate

BatchNorm normalizes across the batch dimension (computes statistics over the batch for each feature). LayerNorm normalizes across the feature dimension (computes statistics over all features for each sample). BatchNorm requires a minimum batch size to compute stable statistics and has issues with very small batches. LayerNorm works per sample, so it's batch-size independent and preferred for Transformers (variable-length sequences), RNNs, and online learning.

Q5: What is dropout and does it help at inference time? 🟢 Beginner

Dropout randomly sets a fraction p of neurons to zero during training. This prevents neurons from co-adapting and forces learning of redundant representations. At inference, dropout is disabled (model.eval() in PyTorch) and all neurons are active. To compensate for the increased scale (all neurons active vs fraction during training), activations are scaled by 1/(1-p) during training (inverted dropout, PyTorch default).

Q6: What are residual connections and why are they important? 🟡 Intermediate

Residual connections (skip connections) add the input of a block directly to its output: output = F(x) + x. Key benefits: (1) gradients flow directly through the skip connection, addressing vanishing gradients in very deep networks, (2) the block only needs to learn the residual F(x) (what to add to x), which is often small and easier to learn, (3) enable training of very deep networks (100+ layers). Introduced in ResNet (2015) and fundamental to Transformers (residual around each attention and FFN block).

Q7: Explain the attention mechanism in neural networks. 🔴 Advanced

Attention computes a weighted sum of values where weights are based on similarity between a query and keys.

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

1. Q, K, V are linear projections of the input
2. QKᵀ computes similarity scores between queries and keys
3. Scale by √dₖ to prevent vanishing gradients
4. Softmax converts to probabilities
5. Multiply by V to get the attended representation

Self-attention: Q, K, V all come from the same sequence (captures relationships within the sequence). Cross-attention: Q from one sequence (decoder), K and V from another (encoder output).

Q8: What is the difference between CNNs and Transformers for image tasks? 🔴 Advanced

CNNs use local convolutional filters — effective for capturing local patterns and translation equivariance. They process images hierarchically (local → global). Transformers use self-attention — each patch can attend to every other patch from the start, capturing long-range dependencies. ViT (Vision Transformer) splits images into patches and processes them as a sequence. Transformers need more data than CNNs to learn useful representations but scale better. Modern architectures (ConvNeXt, Swin Transformer) blend both approaches.

Q9: What is the exploding gradient problem and how do you prevent it? 🟡 Intermediate

Exploding gradients occur when gradient values grow exponentially through deep layers, causing extremely large weight updates that destabilize training (NaN losses). Common in RNNs. Solutions: (1) gradient clipping — cap gradients at a maximum norm, (2) weight initialization — use Xavier/He initialization, (3) batch normalization — normalizes activations, preventing extreme values, (4) LSTM/GRU — gating mechanisms limit gradient flow, (5) residual connections — provide gradient highways.

Q10: What is the difference between fine-tuning and training from scratch? 🟢 Beginner

Training from scratch initializes weights randomly and trains on a dataset from nothing — requires large datasets and significant compute. Fine-tuning starts from pretrained weights (learned on a large dataset) and continues training on a task-specific dataset — requires much less data and compute. Fine-tuning leverages transfer learning: knowledge from the pretraining task generalizes to the new task. For most practical NLP and vision tasks, fine-tuning a pretrained model outperforms training from scratch.

Q11: What is the dying ReLU problem? 🟡 Intermediate

A ReLU neuron "dies" when its input is always negative — gradient is 0, so the neuron never updates and permanently outputs 0. Causes: high learning rates causing large negative weight updates, poor initialization, gradient flow issues. Solutions: Leaky ReLU (small negative slope for x<0), ELU, GELU/SiLU (smooth, always non-zero gradient), careful learning rate tuning, good initialization.

Q12: Compare the Adam and SGD optimizers. 🔴 Advanced

SGD with momentum: Updates based on gradient with velocity accumulation. Simple, generalizes well in some domains. Requires careful learning rate tuning. Often finds wider minima (better generalization).

Adam: Maintains adaptive learning rates per parameter using first moment (mean) and second moment (variance of gradients). Converges faster, less sensitive to learning rate. Can overfit more easily, may converge to sharp minima with worse generalization. Weight decay in Adam should use AdamW (decoupled) for correct regularization. In practice: Adam/AdamW for transformers and NLP, SGD+momentum often used in CV tasks.

Q13: What is the universal approximation theorem and what are its limitations? 🔴 Advanced

The theorem states that a feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of Rⁿ to arbitrary precision, given a non-polynomial activation function.

Limitations: (1) "sufficient neurons" can be exponential in input dimensions, (2) says nothing about how to find the weights (optimization), (3) doesn't address generalization (approximating training data ≠ generalizing to new data), (4) approximating a function and actually learning it from finite noisy data are different problems.

Q14: What is gradient checkpointing? 🔴 Advanced

Gradient checkpointing trades compute for memory. Normally, all intermediate activations are stored during the forward pass for use in backpropagation — memory grows linearly with depth. With gradient checkpointing, only a subset of activations are stored (checkpoints); others are recomputed from the nearest checkpoint during backpropagation. This reduces memory from O(n) to O(√n) at the cost of ~30% more compute. Essential for training large models on memory-constrained hardware.

Q15: What are the key differences between encoder-only, decoder-only, and encoder-decoder transformer architectures? 🟡 Intermediate

Architecture	Examples	Self-Attention	Use Case
Encoder-only	BERT, RoBERTa	Bidirectional (all tokens attend to all)	Classification, NER, QA
Decoder-only	GPT, Llama, Gemma	Causal (tokens only attend to past)	Text generation, chat
Encoder-Decoder	T5, BART, Whisper	Encoder: bidirectional; Decoder: causal + cross-attention	Translation, summarization, speech

Q16: What is knowledge distillation? 🔴 Advanced

Knowledge distillation trains a small "student" model to mimic a large "teacher" model. The student learns from soft probability distributions (temperature-scaled logits) output by the teacher rather than just hard labels. Soft targets carry richer information about class relationships (e.g., "30% likely class A, 20% likely class B"). This allows small models to achieve performance close to large models, enabling deployment on constrained hardware. Example: DistilBERT (66% of BERT parameters, 97% of BERT performance).

Q17: What is a hyperparameter and how do you tune them? 🟢 Beginner

Hyperparameters are configuration values set before training that control the learning process (not learned from data). Examples: learning rate, batch size, number of layers, dropout rate, weight decay. Tuning methods: (1) grid search — exhaustive search over specified values, (2) random search — randomly sample combinations, often more efficient, (3) Bayesian optimization — use probabilistic model to select promising hyperparameters, (4) population-based training — evolve hyperparameters during training. Tools: Optuna, W&B Sweeps, Ray Tune.

Q18: What is a variational autoencoder (VAE)? 🔴 Advanced

A VAE is a generative model that learns a latent distribution rather than discrete encodings. The encoder maps input to a mean μ and variance σ of a Gaussian distribution. The decoder samples from z ~ N(μ, σ²) and reconstructs the input. Loss = reconstruction loss + KL divergence (pushes latent distribution toward N(0, I)). The reparameterization trick (z = μ + σ·ε, ε ~ N(0,1)) enables gradient flow through the sampling step. VAEs enable smooth latent space interpolation and new sample generation.

Q19: What is the difference between discriminative and generative models? 🟡 Intermediate

Discriminative models learn the conditional probability P(Y|X) — the boundary between classes. Examples: logistic regression, SVM, BERT for classification. Generally better at classification when trained on sufficient labeled data.

Generative models learn the joint distribution P(X, Y) or just P(X). Can generate new samples from the learned distribution. Examples: Naive Bayes, VAE, GAN, diffusion models. Useful when you need to generate data, with limited labeled data (using P(X) as prior), or for understanding data structure.

Q20: What is gradient accumulation and when would you use it? 🟡 Intermediate

Gradient accumulation delays optimizer updates for multiple forward/backward passes, effectively increasing the batch size without increasing memory. Instead of updating after each batch, gradients are accumulated over N mini-batches, then an optimizer step is taken. Use when: memory constraints prevent large batch training, but the optimization benefits from large batches (stability, better generalization). Common in LLM fine-tuning: per_device_batch_size=4, gradient_accumulation_steps=8 → effective batch size 32.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
intro_applied_deep_learning.md		intro_applied_deep_learning.md
intro_computer_vision.md		intro_computer_vision.md
intro_fine_tuning.md		intro_fine_tuning.md
intro_transformers.md		intro_transformers.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Deep Learning

Table of Contents

Neural Network Fundamentals

Neurons and Layers

Universal Approximation Theorem

Activation Functions

Training

Forward Pass

Backpropagation

Gradient Descent Variants

Regularization Techniques

Dropout

Batch Normalization

Layer Normalization

Weight Decay (L2 Regularization)

Early Stopping

Gradient Clipping

Architecture Overview

Convolutional Neural Networks (CNNs)

Recurrent Neural Networks (RNNs) and LSTMs

Transformers

Fine-Tuning LLMs (LoRA / QLoRA / PEFT)

Computer Vision

Diffusion Models

Transfer Learning and Fine-Tuning

Approaches

Loss Functions

Interview Q&A

References

FilesExpand file tree

deep_learning

Directory actions

More options

Directory actions

More options

Latest commit

History

deep_learning

Folders and files

parent directory

README.md

Deep Learning

Table of Contents

Neural Network Fundamentals

Neurons and Layers

Universal Approximation Theorem

Activation Functions

Training

Forward Pass

Backpropagation

Gradient Descent Variants

Regularization Techniques

Dropout

Batch Normalization

Layer Normalization

Weight Decay (L2 Regularization)

Early Stopping

Gradient Clipping

Architecture Overview

Convolutional Neural Networks (CNNs)

Recurrent Neural Networks (RNNs) and LSTMs

Transformers

Fine-Tuning LLMs (LoRA / QLoRA / PEFT)

Computer Vision

Diffusion Models

Transfer Learning and Fine-Tuning

Approaches

Loss Functions

Interview Q&A

References