Categories
Tags
algorithms APIT arm assembly asynchronous base64 Blogging box c clang-format cmake compiler concurrency const_fn contravariant cos covariant cpp Customization cybersecurity DataStructure db Demo deserialization discrete doc DP Dynamic Example FFI flat_map FP Functional functions futures Fuwari GATs gccrs generics gitignore GUI hacking hashmap haskell heap interop invariant iterator justfile kernel LaTeX LFU linux MachineLearning Markdown math ML OnceLock optimization OS parallels perf physics pin postgresql release RPIT rust science Science serialization shift sin SmallProjects std String surrealdb swisstable synchronous tan traits triangulation utf16 utf8 Video x86_64 xilem zig
346 words
2 minutes
260211_How_to_Train_a_Neural_Net_SGD001
link
How to train a neural net
Review of gradient descent, SGD
Computation graphs
Backprop through chains
Backprop through MLPs
Backprop through DAGs
Differentiable programming
Gradient Descent
The core Gradient Descent update rule is:
📌 Meaning of Each Symbol
: parameter vector at step
: learning rate (step size)
: gradient of the cost function
: objective (loss) function
CLI Visualization
- Imagine the curve:
Cost
^
| *
| *
| *
| *
| *
| *
+-----------------------> theta
0 1 2 3- Each iteration moves right toward the minimum.
📌 Intuition
- Gradient Descent moves parameters in the opposite direction of the gradient, because:
- The gradient points toward the steepest increase
- We want to go toward the minimum
- So we subtract it
🔎 1D Case (Single Variable)
- If the function is:
- Then the update becomes:
🔎 Multi-Dimensional Case
- if:
- Then:
- Each parameter updates independently:
Rust Implementation
fn main() {
let mut theta: f64 = 0.0; // initial guess
let learning_rate: f64 = 0.1; // η
let iterations = 50;
for i in 0..iterations {
// derivative of (theta - 3)^2
let gradient = 2.0 * (theta - 3.0);
// update rule
theta = theta - learning_rate * gradient;
println!(
"iter {:02} | theta = {:.6} | cost = {:.6}",
i,
theta,
(theta - 3.0).powi(2)
);
}
println!("\nFinal theta ≈ {}", theta);
}Why This Works
Because:
- If θ < 3 → gradient is negative → subtracting negative increases θ
- If θ > 3 → gradient is positive → subtracting decreases θ
It automatically moves toward equilibrium.
📌 What It Means in Gradient Descent
represents the Loss function.
So in ML:
→ model parameters
→ how bad the model is
Gradient Descent minimizes
Update rule:
📌 Why Use Script L?
In machine learning:
- → often used in textbooks
- → common in research papers
They usually mean the same thing: objective / loss function.
📌 Other Similar L-like Symbols
| Symbol | LaTeX | Meaning |
|---|---|---|
\mathcal{L} | Loss function | |
L | Normal letter | |
\ell | Lowercase script l | |
\lambda | Lambda |
260211_How_to_Train_a_Neural_Net_SGD001
https://younghakim7.github.io/blog/posts/260211_how_to_train_a_neural_net_sgd001/