Gradient Descent

Gradient descent is an optimization algorithm used to minimize a cost function by iteratively adjusting its parameters in the direction of the steepest descent, as indicated by the negative gradient of the function.

After picking initial parameters and calculating the gradient (rate of change), small changes of the parameters are performed that reduce the loss. By applying this process iteratively, the loss function can converge to a local minimum.

Gradient descent is widely used in machine learning, deep learning, and statistics.

Learning Rate

The learning rate in machine learning is a factor that you multiply with the gradient when training the model (i.e. performing gradient descent).

It determines the size of the steps taken towards the minimum of the cost function during the optimization process (i.e. how much to update the weights).

Methods for Getting the Gradient

Algebraic approach

Symbolically manipulate cost function to get derivative (using rules)
✅ Perfect accuracy
❌ Not feasible with long cost functions in practice

Finite Differences (easy, numerical approach)

gradient = (cost(x + epsilon) - cost(x))/epsilon
✅ Internals of cost function don't need to be considered
❌ Very slow to evaluate at many locations
❌ Only approximates derivates and requires tuning an epsilon

Dual Numbers

✅ Exact derivatives
❌ y = f(x) needs to be templated to take dual numbers
❌ Also very slow, and takes a lots of memory. One float per derivative.

Backpropagation (hard)

✅ Very fast, exact derivatives
❌ Needs knowledge of y = f(x) to apply chain rule at each step

Methods for Performing Gradient Descent

Stochastic Gradient Descent (SGD): Updates parameters using the gradient computed from a single or a few training examples, which introduces noise and can help escape local minima.
RMSprop (Root Mean Square Propagation): modifies SGD by adapting the learning rate for each parameter based on the average of recent magnitudes of the gradients, which helps to stabilize the updates and improve convergence.
Adam (Adaptive Moment Estimation): Combines ideas from both RMSprop and momentum by maintaining a moving average of both the gradients and the squared gradients, allowing for adaptive learning rates and faster convergence.

Code Examples

#include <stdio.h>

double mean_squared_error(double *x, double *y, double m, double b, int n) {
    double mse = 0.0;
    for (int i = 0; i < n; i++) {
        double prediction = m * x[i] + b; // y = mx + b
        mse += (y[i] - prediction) * (y[i] - prediction);
    }
    return mse / n; // Return average
}

void gradientDescent(double *x, double *y, double *m, double *b, int n, double learningRate, int iterations) {
    for (int iter = 0; iter < iterations; iter++) {
        double m_gradient = 0.0;
        double b_gradient = 0.0;

        // Calculate gradients
        for (int i = 0; i < n; i++) {
            double prediction = (*m) * x[i] + (*b);
            m_gradient += -2 * x[i] * (y[i] - prediction);
            b_gradient += -2 * (y[i] - prediction);
        }

        // Update parameters
        *m -= (m_gradient / n) * learningRate;
        *b -= (b_gradient / n) * learningRate;

        // Optionally print the MSE every 100 iterations
        if (iter % 100 == 0) {
            printf("Iteration %d: MSE = %f\n", iter, mean_squared_error(x, y, *m, *b, n));
        }
    }
}

int main() {
    // Sample data points (x, y)
    double x[] = {1, 2, 3, 4, 5};  // input
    double y[] = {2, 3, 5, 7, 11}; // target to predict
    int n = sizeof(x) / sizeof(x[0]); // Number of data points

    // Initial parameters (y = mx + b)
    double m = 0.0; // Slope
    double b = 0.0; // Intercept
    double learningRate = 0.01;
    int iterations = 1000;

    gradientDescent(x, y, &m, &b, n, learningRate, iterations);
    printf("Final parameters: m = %f, b = %f\n", m, b);
    return 0;
}