Categories
Tags
algorithms APIT Arc arm assembly asynchronous base64 BitHacks Blogging box c clang-format client cmake compiler concat concurrency const_fn contravariant cos covariant cpp Customization cybersecurity DataStructure db debugging Demo deserialization discrete doc DP dtruss Dynamic Example FFI flat_map format FP fsanitize Functional functions futures Fuwari GATs gccrs generics gitignore glibc GUI hacking hashmap haskell heap interop invariant iterator join justfile kernel LaTeX leak LFU linux lto MachineLearning macOS Markdown math ML mmap nc OnceLock optimization OS panic parallels perf physics pin postgresql radare2 release reverse RPIT rust sanitizer science Science serialization server shift sin SmallProjects socket std strace String StringView strip strlen surrealdb SWAR swisstable synchronous tan toml traits triangulation UnsafeRust utf16 utf8 Video wsl x86_64 xilem zig
346 words
2 minutes
260211_How_to_Train_a_Neural_Net_SGD001
link
How to train a neural net
Review of gradient descent, SGD
Computation graphs
Backprop through chains
Backprop through MLPs
Backprop through DAGs
Differentiable programming
Gradient Descent
The core Gradient Descent update rule is:
📌 Meaning of Each Symbol
: parameter vector at step
: learning rate (step size)
: gradient of the cost function
: objective (loss) function
CLI Visualization
- Imagine the curve:
Cost
^
| *
| *
| *
| *
| *
| *
+-----------------------> theta
0 1 2 3- Each iteration moves right toward the minimum.
📌 Intuition
- Gradient Descent moves parameters in the opposite direction of the gradient, because:
- The gradient points toward the steepest increase
- We want to go toward the minimum
- So we subtract it
🔎 1D Case (Single Variable)
- If the function is:
- Then the update becomes:
🔎 Multi-Dimensional Case
- if:
- Then:
- Each parameter updates independently:
Rust Implementation
fn main() {
let mut theta: f64 = 0.0; // initial guess
let learning_rate: f64 = 0.1; // η
let iterations = 50;
for i in 0..iterations {
// derivative of (theta - 3)^2
let gradient = 2.0 * (theta - 3.0);
// update rule
theta = theta - learning_rate * gradient;
println!(
"iter {:02} | theta = {:.6} | cost = {:.6}",
i,
theta,
(theta - 3.0).powi(2)
);
}
println!("\nFinal theta ≈ {}", theta);
}Why This Works
Because:
- If θ < 3 → gradient is negative → subtracting negative increases θ
- If θ > 3 → gradient is positive → subtracting decreases θ
It automatically moves toward equilibrium.
📌 What It Means in Gradient Descent
represents the Loss function.
So in ML:
→ model parameters
→ how bad the model is
Gradient Descent minimizes
Update rule:
📌 Why Use Script L?
In machine learning:
- → often used in textbooks
- → common in research papers
They usually mean the same thing: objective / loss function.
📌 Other Similar L-like Symbols
| Symbol | LaTeX | Meaning |
|---|---|---|
\mathcal{L} | Loss function | |
L | Normal letter | |
\ell | Lowercase script l | |
\lambda | Lambda |
260211_How_to_Train_a_Neural_Net_SGD001
https://younghakim7.github.io/blog/posts/260211_how_to_train_a_neural_net_sgd001/