logoML Specialization

Regularization in Linear and Logistic Regression

Overfitting and Underfitting

Definitions

Overfitting (High Variance) This occurs when the algorithm tries to fit the training data too well and fails to generalize.

  • Characteristics: Uses too many features.
  • Result: Error is almost zero on training data, but the model performs poorly on new, unseen data.

Underfitting (High Bias) This occurs when the algorithm does not fit the training data well and fails to make accurate predictions.

  • Characteristics: Uses too few features.
  • Result: Error is high on training data, and the model performs poorly overall.

Overfitting and Underfitting

When applying these concepts to classification with Logistic Regression, the idea is quite similar:

Overfitting and Underfitting in Logistic Regression


Concept Quiz

Question: Our goal when creating a model is to be able to use the model to predict outcomes correctly for new examples. A model which does this is said to generalize well. When a model fits the training data well but does not work well with new examples that are not in the training set, this is an example of:

A. None of the above
B. Underfitting (high bias)
C. Overfitting (high variance)
D. A model that generalizes well (neither high variance nor high bias)

Answer: C. Overfitting.
Explanation: This is exactly when the model does not generalize well to new examples despite low training error.


Addressing Overfitting

There are three main ways to address overfitting:

1. Collect More Training Data

With a larger training set, the learning algorithm will learn to fit a function that is less "wiggly" and more generalized.

Addressing Overfitting

2. Feature Selection

If we have many features and limited training data, the model is more likely to overfit. In such cases, it is advised to choose only a subset of the most useful features.

  • Example: Keep size, bedroom, age. Discard distance to coffee shop.
  • Disadvantage: Useful information about the data might be lost.

Addressing Overfitting

3. Regularization

In an overfit model, we often notice that the parameters wjw_j are relatively large. Regularization works by encouraging the learning algorithm to shrink the values of the parameters without demanding they be set to exactly zero (unlike feature elimination).

  • Benefit: It allows us to keep all features but prevents them from having an overly large impact.
  • Convention: We generally only regularize the parameters wjw_j and leave bb alone, as regularizing bb typically makes little difference.

Addressing Overfitting

Recap of Addressing Overfitting

Question: Applying regularization, increasing the number of training examples, or selecting a subset of the most relevant features are methods for...

Answer: A. Addressing overfitting (high variance).


Cost Function with Regularization

The Intuition

Given a function

f(x)=w1x+w2x2+w3x3+w4x4+bf(x) = w_1x + w_2x^2 + w_3x^3 + w_4x^4 + b

If we want to penalize specific terms (e.g., w3w_3 and w4w_4) to prevent them from becoming too large, we can modify the cost function:

J(w,b)=12mi=1m(fw,b(x(i))y(i))2+1000w32+1000w44J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})^2 + 1000w_3^2 + 1000w_4^4

By adding a large penalty (1000), the model is forced to keep w3w_3 and w4w_4 small to minimize the overall cost.

Cost Function for Regularization

Generalized Regularized Cost Function

Since we don't know in advance which features are "bad", we penalize all wjw_j parameters.

J(w,b)=12mi=1m(fw,b(x(i))y(i))2+λ2mj=1nwj2J(\vec{w}, b) = \frac{1}{2m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2
  • λ\lambda (Lambda): The regularization parameter (λ>0\lambda > 0).
  • The term
λ2mwj2\frac{\lambda}{2m} \sum w_j^2

balances the trade-off between fitting the data (MSE) and keeping parameters small (Regularization).

  • Minimizing MSE Term: Reduces error between prediction and true value.
  • Minimizing Regularization Term: Keeps wjw_j small to reduce overfitting.

Regularization Parameter

Choosing Lambda (λ\lambda)

  • If λ0\lambda \approx 0:
    No regularization. The model minimizes MSE only, leading to Overfitting.
  • If λ\lambda is extremely large (e.g., 101010^{10}):
    The model forces wj0w_j \approx 0. The function becomes f(x)=bf(x) = b (a flat line). This leads to Underfitting.

Regularization Parameter Intuition

Question: For a model that includes the regularization parameter λ\lambda, increasing λ\lambda will tend to:

Answer: A. Decrease the size of parameters w1,w2,...,wnw_1, w_2, ..., w_n.


Regularized Linear Regression

Cost Function

minw,bJ(w,b)=minw,b[12mi=1m(fw,b(x(i))y(i))2+λ2mj=1nwj2]\min_{\vec{w},b} J(\vec{w},b) = \min_{\vec{w},b} \left[ \frac{1}{2m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2 \right]

Gradient Descent Update Rule

The update rule applies simultaneously to all jj:

Repeat until convergence: {wj=wjαJ(w,b)wjb=bαJ(w,b)b}\begin{align*} \text{Repeat until convergence: } \{ \\ w_j &= w_j - \alpha \frac{\partial J(\vec{w},b)}{\partial w_j} \\ b &= b - \alpha \frac{\partial J(\vec{w},b)}{\partial b} \\ \} \end{align*}

Substituting the derivatives:

wj=wjα[1mi=1m(fw,b(x(i))y(i))xj(i)+λmwj]w_j = w_j - \alpha \left[ \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda}{m} w_j \right]
b=bα[1mi=1m(fw,b(x(i))y(i))]b = b - \alpha \left[ \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)}) \right]

We can rearrange the update for wjw_j to see the impact clearly:

wj=wj(1αλm)α1mi=1m(fw,b(x(i))y(i))xj(i)w_j = w_j \left(1 - \alpha \frac{\lambda}{m}\right) - \alpha \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})x_j^{(i)}
  • The term
(1αλm)(1 - \alpha \frac{\lambda}{m})

is usually slightly less than 1 (e.g., 0.99).

  • This shrinks wjw_j slightly on every iteration before the standard gradient update is applied.

Regularized Linear Regression Regularized Linear Regression Derivation


Regularized Logistic Regression

When training Logistic Regression with many features (e.g., high-degree polynomials), the risk of overfitting increases.

Cost Function

The cost function is the standard Log Loss plus the regularization term:

J(w,b)=1mi=1m[y(i)log(fw,b(x(i)))+(1y(i))log(1fw,b(x(i)))]+λ2mj=1nwj2J(\vec{w},b) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(f_{\vec{w},b}(\vec{x}^{(i)})) + (1-y^{(i)}) \log(1 - f_{\vec{w},b}(\vec{x}^{(i)})) \right] + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2

Gradient Descent

The update rules look identical to Linear Regression, but remember that for Logistic Regression,

fw,b(x)=sigmoid(wx+b)f_{\vec{w},b}(\vec{x}) = \text{sigmoid}(\vec{w} \cdot \vec{x} + b)
wj=wjα[1mi=1m(fw,b(x(i))y(i))xj(i)+λmwj]w_j = w_j - \alpha \left[ \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda}{m} w_j \right]
b=bα[1mi=1m(fw,b(x(i))y(i))]b = b - \alpha \left[ \frac{1}{m} \sum_{i=1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)}) \right]

Gradient descent for regularized logistic regression


Lab: Implementation in Python

The following sections implement regularized Linear and Logistic regression using numpy.

1. Sigmoid Function

Implementation of the helper function

g(z)=11+ezg(z) = \frac{1}{1+e^{-z}}
import numpy as np
import math

def sigmoid(z):
    """
    Compute the sigmoid of z
    Args:
        z (ndarray): A scalar, numpy array of any size.
    Returns:
        g (ndarray): sigmoid(z), with the same shape as z
    """
    g = 1 / (1 + np.exp(-z))
    return g

Output Verification:

sigmoid([ -1, 0, 1, 2]) = [0.26894142 0.5        0.73105858 0.88079708]

2. Cost Functions with Regularization

Linear Regression Cost

Includes the term:

λ2mwj2\frac{\lambda}{2m} \sum w_j^2
def compute_cost_linear_reg(X, y, w, b, lambda_ = 1):
    """
    Computes the cost over all examples
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns:
      total_cost (scalar):  cost 
    """
    m  = X.shape[0]
    n  = len(w)
    cost = 0.
    for i in range(m):
        f_wb_i = np.dot(X[i], w) + b
        cost = cost + (f_wb_i - y[i])**2             
    cost = cost / (2 * m)
 
    reg_cost = 0
    for j in range(n):
        reg_cost += (w[j]**2)
    reg_cost = (lambda_/(2*m)) * reg_cost
    
    total_cost = cost + reg_cost
    return total_cost

Output Verification:

Regularized cost: 0.07917239320214275

Logistic Regression Cost

Implementation of Equation:

J(w,b)=1mi=0m1[y(i)log(fw,b(x(i)))(1y(i))log(1fw,b(x(i)))]+λ2mj=0n1wj2J(\vec{w},b) = \frac{1}{m} \sum_{i=0}^{m-1} \left[ -y^{(i)} \log\left(f_{\vec{w},b}\left( \vec{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\vec{w},b}\left( \vec{x}^{(i)} \right) \right) \right] + \frac{\lambda}{2m} \sum_{j=0}^{n-1} w_j^2
Click to view code: Regularized Logistic Cost
def compute_cost_logistic_reg(X, y, w, b, lambda_ = 1):
    """
    Computes the cost over all examples with regularization
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns:
      total_cost (scalar):  cost 
    """
    m,n  = X.shape
    cost = 0.
    for i in range(m):
        z_i = np.dot(X[i], w) + b
        f_wb_i = sigmoid(z_i)
        # Prevent log(0) errors by clipping or handling edge cases in production
        cost += -y[i]*np.log(f_wb_i) - (1-y[i])*np.log(1-f_wb_i)
             
    cost = cost/m

    reg_cost = 0
    for j in range(n):
        reg_cost += (w[j]**2)
    reg_cost = (lambda_/(2*m)) * reg_cost
    
    total_cost = cost + reg_cost
    return total_cost

Output Verification:

Regularized cost: 0.6850849138741673

3. Gradient Descent with Regularization

The gradient calculation adds the term λmwj\frac{\lambda}{m} w_j to the partial derivative with respect to wjw_j. The bias bb is not regularized.

Linear Regression Gradient

def compute_gradient_linear_reg(X, y, w, b, lambda_): 
    """
    Computes the gradient for linear regression 
    """
    m,n = X.shape
    dj_dw = np.zeros((n,))
    dj_db = 0.

    for i in range(m):                             
        err = (np.dot(X[i], w) + b) - y[i]                 
        for j in range(n):                         
            dj_dw[j] = dj_dw[j] + err * X[i, j]               
        dj_db = dj_db + err                        
    dj_dw = dj_dw / m                                
    dj_db = dj_db / m   
    
    # Add Regularization term
    for j in range(n):
        dj_dw[j] = dj_dw[j] + (lambda_/m) * w[j]

    return dj_db, dj_dw

Output Verification:

dj_db: 0.6648774569425726
Regularized dj_dw:
 [0.29653214748822276, 0.4911679625918033, 0.21645877535865857]

Logistic Regression Gradient

Click to view code: Regularized Logistic Gradient
def compute_gradient_logistic_reg(X, y, w, b, lambda_): 
    """
    Computes the gradient for logistic regression 
    Args:
      X (ndarray (m,n): Data
      y (ndarray (m,)): target values
      w (ndarray (n,)): parameters  
      b (scalar)      : parameter
      lambda_ (scalar): Regularization constant
    Returns
      dj_dw (ndarray Shape (n,)): Gradient of cost w.r.t. parameters w. 
      dj_db (scalar)            : Gradient of cost w.r.t. parameter b. 
    """
    m,n = X.shape
    dj_dw = np.zeros((n,))
    dj_db = 0.0

    for i in range(m):
        f_wb_i = sigmoid(np.dot(X[i],w) + b)
        err_i  = f_wb_i  - y[i]
        for j in range(n):
            dj_dw[j] = dj_dw[j] + err_i * X[i,j]
        dj_db = dj_db + err_i
    dj_dw = dj_dw/m
    dj_db = dj_db/m

    # Add Regularization term
    for j in range(n):
        dj_dw[j] = dj_dw[j] + (lambda_/m) * w[j]

    return dj_db, dj_dw

Output Verification:

dj_db: 0.341798994972791
Regularized dj_dw:
 [0.17380012933994293, 0.32007507881566943, 0.10776313396851499]

4. Gradient Descent Algorithm

The loop that applies the updates. This function works for both linear and logistic regression provided the correct cost_function and gradient_function are passed.

Click to view code: Gradient Descent Loop
def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters, lambda_): 
    """
    Performs batch gradient descent to learn theta.
    """
    # number of training examples
    m = len(X)
    
    # Arrays to store cost J and w's at each iteration
    J_history = []
    w_history = []
    
    for i in range(num_iters):

        # Calculate the gradient and update the parameters
        dj_db, dj_dw = gradient_function(X, y, w_in, b_in, lambda_)   

        # Update Parameters using w, b, alpha and gradient
        w_in = w_in - alpha * dj_dw               
        b_in = b_in - alpha * dj_db              
       
        # Save cost J at each iteration
        if i < 100000:      # prevent resource exhaustion 
            cost =  cost_function(X, y, w_in, b_in, lambda_)
            J_history.append(cost)

        # Print cost every at intervals 10 times or as many iterations if < 10
        if i % math.ceil(num_iters/10) == 0 or i == (num_iters-1):
            w_history.append(w_in)
            print(f"Iteration {i:4}: Cost {float(J_history[-1]):8.2f}   ")
        
    return w_in, b_in, J_history, w_history

Expected Output (Regularized Logistic Regression):

Iteration    0: Cost     0.72   
Iteration 1000: Cost     0.59   
Iteration 2000: Cost     0.56   
Iteration 3000: Cost     0.53   
Iteration 4000: Cost     0.51   
Iteration 5000: Cost     0.50   
Iteration 6000: Cost     0.48   
Iteration 7000: Cost     0.47   
Iteration 8000: Cost     0.46   
Iteration 9000: Cost     0.45   
Iteration 9999: Cost     0.45 

5. Prediction

To interpret the output of Logistic Regression as a class label (0 or 1), we apply a threshold (usually 0.5).

def predict(X, w, b): 
    """
    Predict whether the label is 0 or 1 using learned logistic
    regression parameters w
    """
    m, n = X.shape   
    p = np.zeros(m)
   
    for i in range(m):   
        z_wb = np.dot(X[i], w) + b
        f_wb = sigmoid(z_wb)

        # Apply the threshold
        p[i] = f_wb >= 0.5
        
    return p

Output Verification:

Output of predict: shape (4,), value [0. 1. 1. 1.]
Train Accuracy: 82.203390

Visualizing Regularization Effects

In the lab exercises, mapping features to higher degrees (e.g., polynomial mapping) allows decision boundaries to be non-linear. However, this often leads to overfitting.

  • No Regularization (λ=0\lambda = 0):
    The boundary is very "wiggly" and overfits the training set.
  • High Regularization (λ=1\lambda = 1):
    The boundary smoothes out, sacrificing some training accuracy for better generalization.

Decision Boundary without Regularization:

Decision Boundary with Regularization: