Advanced techniques for linear regression, including feature scaling, feature engineering, and polynomial regression, along with practical lab exercises.

Advanced Regression Techniques

Scaling Features for Better Performance

When features in a dataset have vastly different ranges, it can slow down the convergence of gradient descent. Feature scaling is a technique to bring all features into a similar range, which helps gradient descent find the optimal solution more quickly.

For example, consider predicting house prices with two features:

Size (sq. ft.): Ranges from 300 to 5000.
Number of Bedrooms: Ranges from 1 to 5.

The parameter $w_1$ associated with size will be very small, while the parameter $w_2$ associated with bedrooms will be much larger.

The Impact on the Cost Function

This disparity in feature scales leads to a cost function $J(w,b)$ with elongated, skinny contours.

A small change in $w_1$ (for size) causes a large change in the cost.
A large change in $w_2$ (for bedrooms) is needed to affect the cost similarly.

As a result, the contour plot looks like a set of tall, narrow ellipses. Gradient descent can struggle with such a surface, often bouncing back and forth inefficiently before reaching the minimum.

By scaling the features (e.g., transforming both to a range of 0 to 1), the contours of the cost function become more circular. This allows gradient descent to take a more direct path to the global minimum.

Goal of Feature Scaling

The main goal is to transform features so they have comparable ranges. This ensures that each feature contributes more equally to the model's learning process and helps gradient descent converge faster.

Methods for Feature Scaling

Here are three common methods for feature scaling:

Dividing by the Maximum
- Formula:
$x_1 = \frac{x_1}{\text{max}(x)}$
- This scales the feature to a range between 0 and 1 (or -1 and 1 if there are negative values). It's simple and effective for features that are strictly positive.
Mean Normalization
- Formula:
$x_1 = \frac{x_1 - \mu}{\text{max} - \text{min}}$
- This method centers the data around 0 and scales it by the range. The resulting features will generally be in the range of -1 to 1.
Z-Score Normalization
- Formula:
$x_1 = \frac{x_1 - \mu_1}{\sigma_1}$
(where $\mu_1$ is the mean and $\sigma_1$ is the standard deviation of feature $x_1$ )
- This is a very common and effective method. It rescales features to have a mean of 0 and a standard deviation of 1.

Note: When in doubt, applying feature scaling is generally a good idea and rarely hurts the model's performance.

Knowledge Check

Question: Which of the following is a valid step used during feature scaling?

A. Multiply each value by the maximum value for that feature.
B. Divide each value by the maximum value for that feature.

Answer:

B. By dividing all values by the maximum, the new range of the rescaled features will have a maximum value of 1.

Monitoring Gradient Descent

It's crucial to monitor gradient descent to ensure it's converging correctly.

Checking for Convergence

A learning curve is a plot of the cost function $J(w,b)$ over the number of iterations.

If gradient descent is working correctly, the cost $J(w,b)$ should decrease after every iteration.
The curve should eventually flatten out, indicating that the algorithm has converged to a minimum.

Choosing the Right Learning Rate ( $α$ )

The learning rate $α$ is a critical hyperparameter.

If $α$ is too large: The cost may increase or oscillate wildly. The algorithm might "overshoot" the minimum and diverge.
If $α$ is too small: Gradient descent will be very slow to converge.

A good approach is to try a range of $α$ values (e.g., 0.001, 0.01, 0.1, 1.0) and plot their learning curves to find a value that causes the cost to decrease quickly and consistently.

Knowledge Check

Question: You run gradient descent for 15 iterations with $α$ = 0.3 and compute $J(w)$ after each iteration. You find that the value of $J(w)$ increases over time. How do you think you should adjust the learning rate $α$ ?

A. Try a larger value of $α$ = 1.0
B. Keep running it for additional iterations
C. Try a smaller value of $α$ = 0.1
D. Try running it for only 10 iterations so $J(w)$ doesn't increase as much.

Answer:

C. Since the cost function is increasing, we know that gradient descent is diverging. This indicates that the learning rate is too high, so we should try a smaller value.

Lab: Feature Scaling and Learning Rate in Practice

Goals

Run Gradient Descent on a dataset with multiple features.
Explore the impact of the learning rate $α$ on convergence.
Improve performance by applying Z-score normalization.

Problem Statement

You will use a housing dataset with four features to predict the price of a house.

Feature	Description
`size(sqft)`	Size of the house in square feet.
`bedrooms`	Number of bedrooms.
`floors`	Number of floors.
`age`	Age of the house in years.

Setup

import numpy as np
import matplotlib.pyplot as plt
from lab_utils_multi import  load_house_data, run_gradient_descent
from lab_utils_multi import  norm_plot, plt_equal_scale, plot_cost_i_w
from lab_utils_common import dlc
np.set_printoptions(precision=2)
plt.style.use('./deeplearning.mplstyle')

Load and Visualize the Data

# load the dataset
X_train, y_train = load_house_data()
X_features = ['size(sqft)','bedrooms','floors','age']

# Plot each feature vs. the target, price
fig,ax=plt.subplots(1, 4, figsize=(12, 3), sharey=True)
for i in range(len(ax)):
    ax[i].scatter(X_train[:,i],y_train)
    ax[i].set_xlabel(X_features[i])
ax[0].set_ylabel("Price (1000's)")
plt.show()

The plots show that size and age have a stronger correlation with price than bedrooms or floors.

Gradient Descent for Multiple Variables

The update rules for gradient descent remain the same, but are now applied to each parameter $w_j$ and $b$ .

w_j := w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j}

b := b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b}

The Impact of the Learning Rate (α)

Let's run gradient descent with the raw (unscaled) data and observe the effect of different learning rates.

1. $α$ = 9.9e-7 (Too Large)

#set alpha to 9.9e-7
_, _, hist = run_gradient_descent(X_train, y_train, 10, alpha = 9.9e-7)

The cost function increases with each iteration, a clear sign that the learning rate is too high and the algorithm is diverging.

Click to see full output

Iteration Cost          w0       w1       w2       w3       b       djdw0    djdw1    djdw2    djdw3    djdb
---------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
        0 9.55884e+04  5.5e-01  1.0e-03  5.1e-04  1.2e-02  3.6e-04 -5.5e+05 -1.0e+03 -5.2e+02 -1.2e+04 -3.6e+02
        1 1.28213e+05 -8.8e-02 -1.7e-04 -1.0e-04 -3.4e-03 -4.8e-05  6.4e+05  1.2e+03  6.2e+02  1.6e+04  4.1e+02
        2 1.72159e+05  6.5e-01  1.2e-03  5.9e-04  1.3e-02  4.3e-04 -7.4e+05 -1.4e+03 -7.0e+02 -1.7e+04 -4.9e+02
        3 2.31358e+05 -2.1e-01 -4.0e-04 -2.3e-04 -7.5e-03 -1.2e-04  8.6e+05  1.6e+03  8.3e+02  2.1e+04  5.6e+02
        4 3.11100e+05  7.9e-01  1.4e-03  7.1e-04  1.5e-02  5.3e-04 -1.0e+06 -1.8e+03 -9.5e+02 -2.3e+04 -6.6e+02
        5 4.18517e+05 -3.7e-01 -7.1e-04 -4.0e-04 -1.3e-02 -2.1e-04  1.2e+06  2.1e+03  1.1e+03  2.8e+04  7.5e+02
        6 5.63212e+05  9.7e-01  1.7e-03  8.7e-04  1.8e-02  6.6e-04 -1.3e+06 -2.5e+03 -1.3e+03 -3.1e+04 -8.8e+02
        7 7.58122e+05 -5.8e-01 -1.1e-03 -6.2e-04 -1.9e-02 -3.4e-04  1.6e+06  2.9e+03  1.5e+03  3.8e+04  1.0e+03
        8 1.02068e+06  1.2e+00  2.2e-03  1.1e-03  2.3e-02  8.3e-04 -1.8e+06 -3.3e+03 -1.7e+03 -4.2e+04 -1.2e+03
        9 1.37435e+06 -8.7e-01 -1.7e-03 -9.1e-04 -2.7e-02 -5.2e-04  2.1e+06  3.9e+03  2.0e+03  5.1e+04  1.4e+03
w,b found by gradient descent: w: [-0.87 -0.   -0.   -0.03], b: -0.00

plot_cost_i_w(X_train, y_train, hist)

2. $α$ = 9e-7 (Moderate)

#set alpha to 9e-7
_,_,hist = run_gradient_descent(X_train, y_train, 10, alpha = 9e-7)

The cost is now decreasing, but the parameter $w_0$ is oscillating around the optimal value, indicating the learning rate is still a bit high, causing it to "jump over" the minimum. It will eventually converge, but slowly.

Click to see full output

Iteration Cost          w0       w1       w2       w3       b       djdw0    djdw1    djdw2    djdw3    djdb
---------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
        0 6.64616e+04  5.0e-01  9.1e-04  4.7e-04  1.1e-02  3.3e-04 -5.5e+05 -1.0e+03 -5.2e+02 -1.2e+04 -3.6e+02
        1 6.18990e+04  1.8e-02  2.1e-05  2.0e-06 -7.9e-04  1.9e-05  5.3e+05  9.8e+02  5.2e+02  1.3e+04  3.4e+02
        2 5.76572e+04  4.8e-01  8.6e-04  4.4e-04  9.5e-03  3.2e-04 -5.1e+05 -9.3e+02 -4.8e+02 -1.1e+04 -3.4e+02
        3 5.37137e+04  3.4e-02  3.9e-05  2.8e-06 -1.6e-03  3.8e-05  4.9e+05  9.1e+02  4.8e+02  1.2e+04  3.2e+02
        4 5.00474e+04  4.6e-01  8.2e-04  4.1e-04  8.0e-03  3.2e-04 -4.8e+05 -8.7e+02 -4.5e+02 -1.1e+04 -3.1e+02
        5 4.66388e+04  5.0e-02  5.6e-05  2.5e-06 -2.4e-03  5.6e-05  4.6e+05  8.5e+02  4.5e+02  1.2e+04  2.9e+02
        6 4.34700e+04  4.5e-01  7.8e-04  3.8e-04  6.4e-03  3.2e-04 -4.4e+05 -8.1e+02 -4.2e+02 -9.8e+03 -2.9e+02
        7 4.05239e+04  6.4e-02  7.0e-05  1.2e-06 -3.3e-03  7.3e-05  4.3e+05  7.9e+02  4.2e+02  1.1e+04  2.7e+02
        8 3.77849e+04  4.4e-01  7.5e-04  3.5e-04  4.9e-03  3.2e-04 -4.1e+05 -7.5e+02 -3.9e+02 -9.1e+03 -2.7e+02
        9 3.52385e+04  7.7e-02  8.3e-05 -1.1e-06 -4.2e-03  8.9e-05  4.0e+05  7.4e+02  3.9e+02  1.0e+04  2.5e+02
w,b found by gradient descent: w: [ 7.74e-02  8.27e-05 -1.06e-06 -4.20e-03], b: 0.00

plot_cost_i_w(X_train, y_train, hist)

3. $α$ = 1e-7 (Too Small)

#set alpha to 1e-7
_,_,hist = run_gradient_descent(X_train, y_train, 10, alpha = 1e-7)

The cost decreases steadily, but very slowly. This would require many more iterations to converge.

Click to see full output

Iteration Cost          w0       w1       w2       w3       b       djdw0    djdw1    djdw2    djdw3    djdb
---------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
        0 4.42313e+04  5.5e-02  1.0e-04  5.2e-05  1.2e-03  3.6e-05 -5.5e+05 -1.0e+03 -5.2e+02 -1.2e+04 -3.6e+02
        1 2.76461e+04  9.8e-02  1.8e-04  9.2e-05  2.2e-03  6.5e-05 -4.3e+05 -7.9e+02 -4.0e+02 -9.5e+03 -2.8e+02
        2 1.75102e+04  1.3e-01  2.4e-04  1.2e-04  2.9e-03  8.7e-05 -3.4e+05 -6.1e+02 -3.1e+02 -7.3e+03 -2.2e+02
        3 1.13157e+04  1.6e-01  2.9e-04  1.5e-04  3.5e-03  1.0e-04 -2.6e+05 -4.8e+02 -2.4e+02 -5.6e+03 -1.8e+02
        4 7.53002e+03  1.8e-01  3.3e-04  1.7e-04  3.9e-03  1.2e-04 -2.1e+05 -3.7e+02 -1.9e+02 -4.2e+03 -1.4e+02
        5 5.21639e+03  2.0e-01  3.5e-04  1.8e-04  4.2e-03  1.3e-04 -1.6e+05 -2.9e+02 -1.5e+02 -3.1e+03 -1.1e+02
        6 3.80242e+03  2.1e-01  3.8e-04  1.9e-04  4.5e-03  1.4e-04 -1.3e+05 -2.2e+02 -1.1e+02 -2.3e+03 -8.6e+01
        7 2.93826e+03  2.2e-01  3.9e-04  2.0e-04  4.6e-03  1.4e-04 -9.8e+04 -1.7e+02 -8.6e+01 -1.7e+03 -6.8e+01
        8 2.41013e+03  2.3e-01  4.1e-04  2.1e-04  4.7e-03  1.5e-04 -7.7e+04 -1.3e+02 -6.5e+01 -1.2e+03 -5.4e+01
        9 2.08734e+03  2.3e-01  4.2e-04  2.1e-04  4.8e-03  1.5e-04 -6.0e+04 -1.0e+02 -4.9e+01 -7.5e+02 -4.3e+01
w,b found by gradient descent: w: [2.31e-01 4.18e-04 2.12e-04 4.81e-03], b: 0.00

This process highlights the difficulty of finding a good learning rate when features have very different scales.

Feature Scaling in Action

Let's apply Z-score normalization to solve this problem.

x_j = \frac{x_j - \mu_j}{\sigma_j}

Implementation

def zscore_normalize_features(X):
    """
    computes  X, zcore normalized by column
    
    Args:
      X (ndarray (m,n))     : input data, m examples, n features
      
    Returns:
      X_norm (ndarray (m,n)): input normalized by column
      mu (ndarray (n,))     : mean of each feature
      sigma (ndarray (n,))  : standard deviation of each feature
    """
    # find the mean of each column/feature
    mu     = np.mean(X, axis=0)
    # find the standard deviation of each column/feature
    sigma  = np.std(X, axis=0)
    # element-wise, subtract mu for that column from each example, divide by std for that column
    X_norm = (X - mu) / sigma      

    return (X_norm, mu, sigma)

Visualizing the Transformation

The normalization process centers each feature around zero and gives it a standard deviation of one.

Left (Unnormalized): The scale of size(sqft) is vastly different from age.
Middle (Mean Subtracted): The features are centered around zero.
Right (Z-score Normalized): Both features are now centered at zero and have a similar scale.

Applying Normalization to the Data

# normalize the original features
X_norm, X_mu, X_sigma = zscore_normalize_features(X_train)
print(f"X_mu = {X_mu}, \nX_sigma = {X_sigma}")
print(f"Peak to Peak range by column in Raw        X:{np.ptp(X_train,axis=0)}")   
print(f"Peak to Peak range by column in Normalized X:{np.ptp(X_norm,axis=0)}")

Output:

X_mu = [1.42e+03 2.72e+00 1.38e+00 3.84e+01], 
X_sigma = [411.62   0.65   0.49  25.78]
Peak to Peak range by column in Raw        X:[2.41e+03 4.00e+00 1.00e+00 9.50e+01]
Peak to Peak range by column in Normalized X:[5.85 6.14 2.06 3.69]

The peak-to-peak range is now much more consistent across features.

Rerunning Gradient Descent with Normalized Data

With scaled features, we can use a much larger learning rate, α = 1.0e-1, which drastically speeds up convergence.

w_norm, b_norm, hist = run_gradient_descent(X_norm, y_train, 1000, 1.0e-1)

Click to see full output

Iteration Cost          w0       w1       w2       w3       b       djdw0    djdw1    djdw2    djdw3    djdb
---------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
        0 5.76170e+04  8.9e+00  3.0e+00  3.3e+00 -6.0e+00  3.6e+01 -8.9e+01 -3.0e+01 -3.3e+01  6.0e+01 -3.6e+02
      100 2.21086e+02  1.1e+02 -2.0e+01 -3.1e+01 -3.8e+01  3.6e+02 -9.2e-01  4.5e-01  5.3e-01 -1.7e-01 -9.6e-03
      200 2.19209e+02  1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01  3.6e+02 -3.0e-02  1.5e-02  1.7e-02 -6.0e-03 -2.6e-07
      300 2.19207e+02  1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01  3.6e+02 -1.0e-03  5.1e-04  5.7e-04 -2.0e-04 -6.9e-12
      400 2.19207e+02  1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01  3.6e+02 -3.4e-05  1.7e-05  1.9e-05 -6.6e-06 -2.7e-13
      500 2.19207e+02  1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01  3.6e+02 -1.1e-06  5.6e-07  6.2e-07 -2.2e-07 -2.6e-13
      600 2.19207e+02  1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01  3.6e+02 -3.7e-08  1.9e-08  2.1e-08 -7.3e-09 -2.6e-13
      700 2.19207e+02  1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01  3.6e+02 -1.2e-09  6.2e-10  6.9e-10 -2.4e-10 -2.6e-13
      800 2.19207e+02  1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01  3.6e+02 -4.1e-11  2.1e-11  2.3e-11 -8.1e-12 -2.7e-13
      900 2.19207e+02  1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01  3.6e+02 -1.4e-12  7.0e-13  7.6e-13 -2.7e-13 -2.6e-13
w,b found by gradient descent: w: [110.56 -21.27 -32.71 -37.97], b: 363.16

The model converges very quickly to a low cost. The scaled features allow for much faster and more stable training.

Predictions vs. Target Values

#predict target using normalized features
m = X_norm.shape[0]
yp = np.zeros(m)
for i in range(m):
    yp[i] = np.dot(X_norm[i], w_norm) + b_norm

# plot predictions and targets versus original features    
fig,ax=plt.subplots(1,4,figsize=(12, 3),sharey=True)
for i in range(len(ax)):
    ax[i].scatter(X_train[:,i],y_train, label = 'target')
    ax[i].set_xlabel(X_features[i])
    ax[i].scatter(X_train[:,i],yp,color=dlc["dlorange"], label = 'predict')
ax[0].set_ylabel("Price"); ax[0].legend();
fig.suptitle("target versus prediction using z-score normalized model")
plt.show()

The model provides good predictions across all features.

Predicting on New Data

To predict the price of a new house, you must normalize its features using the same $\mu$ and $\sigma$ calculated from the training set.

# Predict the price of a house with 1200 sqft, 3 bedrooms, 1 floor, 40 years old.
x_house = np.array([1200, 3, 1, 40])
x_house_norm = (x_house - X_mu) / X_sigma
print(x_house_norm)
x_house_predict = np.dot(x_house_norm, w_norm) + b_norm
print(f" predicted price of a house with 1200 sqft, 3 bedrooms, 1 floor, 40 years old = ${x_house_predict*1000:0.0f}")

Output:

[-0.53  0.43 -0.79  0.06]
 predicted price of a house with 1200 sqft, 3 bedrooms, 1 floor, 40 years old = $318709

Cost Contours Comparison

These plots visually confirm why feature scaling works. The contours for the unscaled data are elongated, while the normalized data yields circular contours, making the path to the minimum much more direct.

Feature Engineering & Polynomial Regression

Linear regression can be extended to model non-linear relationships through feature engineering, which is the process of creating new features by transforming or combining existing ones.

Feature Engineering

The choice of features significantly impacts a model's performance. By using domain knowledge and intuition, we can create features that better capture the underlying patterns in the data.

Knowledge Check

Question: If you have measurements for the dimensions of a swimming pool (length, width, height), which of the following two would be a more useful engineered feature?

A. $\text{length} \times \text{width} \times \text{height}$
B. $\text{length} + \text{width} + \text{height}$

Answer:

A. The volume $(\text{length} \times \text{width} \times \text{height})$ of the swimming pool is likely a more useful feature for many prediction tasks (e.g., predicting the cost to fill it) than the sum of its dimensions.

Polynomial Regression

What if your data doesn't follow a straight line? You can still use linear regression by engineering new polynomial features.

For example, if your data seems to follow a quadratic curve:

Instead of the model $f(x) = w_1 x_1 + b$ , you can create a new feature $x_1^2$ and fit the model: $f(x) = w_1 x_1 + w_2 (x_1^2) + b$

Even though the function is a curve with respect to $x$ , it is a linear function with respect to the features $x_1$ and $x_1^2$ . This allows us to use the same linear regression algorithm.

Lab: Implementing Polynomial Regression

Goals

Explore how to use feature engineering to fit non-linear data.
Understand how linear regression can model complex functions through polynomial features.

Setup

import numpy as np
import matplotlib.pyplot as plt
from lab_utils_multi import zscore_normalize_features, run_gradient_descent_feng
np.set_printoptions(precision=2)  # reduced display precision on numpy arrays

Linear Model on Non-linear Data

Let's try to fit a linear model to quadratic data $y = 1 + x^2$ .

# create target data
x = np.arange(0, 20, 1)
y = 1 + x**2
X = x.reshape(-1, 1)

model_w,model_b = run_gradient_descent_feng(X,y,iterations=1000, alpha = 1e-2)

plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("no feature engineering")
plt.plot(x,X@model_w + model_b, label="Predicted Value");  plt.xlabel("X"); plt.ylabel("y"); plt.legend(); plt.show()

Output:

w,b found by gradient descent: w: [18.7], b: -52.0834

As expected, a straight line is a poor fit for this data.

Adding a Polynomial Feature (x²)

Now, let's engineer a new feature, $x^2$ , and train the model again.

# create target data
x = np.arange(0, 20, 1)
y = 1 + x**2

# Engineer features
X = x**2      #<-- added engineered feature
X = X.reshape(-1, 1)  #X should be a 2-D Matrix

model_w,model_b = run_gradient_descent_feng(X, y, iterations=10000, alpha = 1e-5)

plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("Added x**2 feature")
plt.plot(x, np.dot(X,model_w) + model_b, label="Predicted Value"); plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()

Output:

w,b found by gradient descent: w: [1.], b: 0.0490

The fit is now nearly perfect! The learned parameters $w \approx [1.0]$ and $b \approx 0.05$ are very close to the true model $y = 1 \cdot x^2 + 1$ .

Selecting Features

What if we aren't sure which polynomial terms are needed? We can add several and let gradient descent figure it out. Let's try fitting $y = x^2$ with features $x$ , $x^2$ , and $x^3$ .

# create target data
x = np.arange(0, 20, 1)
y = x**2

# engineer features
X = np.c_[x, x**2, x**3]

model_w,model_b = run_gradient_descent_feng(X, y, iterations=10000, alpha=1e-7)

plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("x, x**2, x**3 features")
plt.plot(x, X@model_w + model_b, label="Predicted Value"); plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()

Output:

w,b found by gradient descent: w: [0.08 0.54 0.03], b: 0.0106

The learned model is $y \approx 0.08x + 0.54x^2 + 0.03x^3 + 0.01$ . Gradient descent has assigned the largest weight ( $w_1=0.54$ ) to the $x^2$ feature, correctly identifying it as the most important one. The weights for $x$ and $x^3$ are much smaller.

An Alternate View

The best features for linear regression are those that have a linear relationship with the target $y$ . Plotting our engineered features against $y$ confirms this.

# create target data
x = np.arange(0, 20, 1)
y = x**2

# engineer features
X = np.c_[x, x**2, x**3]
X_features = ['x','x^2','x^3']

fig,ax=plt.subplots(1, 3, figsize=(12, 3), sharey=True)
for i in range(len(ax)):
    ax[i].scatter(X[:,i],y)
    ax[i].set_xlabel(X_features[i])
ax[0].set_ylabel("y")
plt.show()

Clearly, $x^2$ has a linear relationship with $y$ , making it the perfect feature for a linear regression model.

Scaling Features

When creating polynomial features like $x$ , $x^2$ , and $x^3$ , their scales will be vastly different. Feature scaling is essential here to speed up gradient descent.

# create target data
x = np.arange(0,20,1)
y = x**2

X = np.c_[x, x**2, x**3]
print(f"Peak to Peak range by column in Raw        X:{np.ptp(X,axis=0)}")

# add z-score normalization
X = zscore_normalize_features(X)     
print(f"Peak to Peak range by column in Normalized X:{np.ptp(X,axis=0)}")

Output:

Peak to Peak range by column in Raw        X:[  19  361 6859]
Peak to Peak range by column in Normalized X:[3.3  3.18 3.28]

Now, with scaled features, we can use a much larger learning rate and converge faster.

model_w, model_b = run_gradient_descent_feng(X, y, iterations=100000, alpha=1e-1)
# ... plotting code ...

Output:

w,b found by gradient descent: w: [5.27e-05 1.13e+02 8.43e-05], b: 123.5000

After normalization, gradient descent gives a much larger weight to the $x^2$ term and almost zero weight to the others, resulting in a very accurate model.

Modeling Complex Functions

With enough polynomial features, we can model even highly complex functions, like a cosine wave.

x = np.arange(0,20,1)
y = np.cos(x/2)

# Engineer features up to x^13
X = np.c_[x, x**2, x**3,x**4, x**5, x**6, x**7, x**8, x**9, x**10, x**11, x**12, x**13]
# Normalize them
X = zscore_normalize_features(X) 

model_w,model_b = run_gradient_descent_feng(X, y, iterations=1000000, alpha = 1e-1)
# ... plotting code ...

Lab: Linear Regression with Scikit-Learn

Instead of implementing algorithms from scratch, we can use powerful, open-source libraries like scikit-learn.

Goals

Utilize scikit-learn to implement linear regression using Gradient Descent.

Setup

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from lab_utils_multi import  load_house_data
from lab_utils_common import dlc
np.set_printoptions(precision=2)
plt.style.use('./deeplearning.mplstyle')

1. Load the Data

X_train, y_train = load_house_data()
X_features = ['size(sqft)','bedrooms','floors','age']

2. Scale/Normalize the Data

Scikit-learn's StandardScaler performs Z-score normalization.

scaler = StandardScaler()
X_norm = scaler.fit_transform(X_train)
print(f"Peak to Peak range by column in Raw        X:{np.ptp(X_train,axis=0)}")   
print(f"Peak to Peak range by column in Normalized X:{np.ptp(X_norm,axis=0)}")

Output:

Peak to Peak range by column in Raw        X:[2.41e+03 4.00e+00 1.00e+00 9.50e+01]
Peak to Peak range by column in Normalized X:[5.85 6.14 2.06 3.69]

3. Create and Fit the Model

SGDRegressor implements linear regression using Stochastic Gradient Descent.

# Create an instance of the model
sgdr = SGDRegressor(max_iter=1000)
# Fit the model to the normalized data
sgdr.fit(X_norm, y_train)
print(sgdr)
print(f"number of iterations completed: {sgdr.n_iter_}, number of weight updates: {sgdr.t_}")

Output:

SGDRegressor()
number of iterations completed: 106, number of weight updates: 10495.0

4. View Parameters

The learned parameters are stored in sgdr.coef_ (for $\mathbf{w}$ ) and sgdr.intercept_ (for $b$ ).

b_norm = sgdr.intercept_
w_norm = sgdr.coef_
print(f"model parameters:                   w: {w_norm}, b:{b_norm}")
print( "model parameters from previous lab: w: [110.56 -21.27 -32.71 -37.97], b: 363.16")

Output:

model parameters:                   w: [109.79 -20.87 -32.26 -38.1 ], b:[363.16]
model parameters from previous lab: w: [110.56 -21.27 -32.71 -37.97], b: 363.16

The results are very close to our manual implementation!

5. Make Predictions and Plot Results

# make a prediction using sgdr.predict()
y_pred_sgd = sgdr.predict(X_norm)

# make a prediction using w,b. 
y_pred = np.dot(X_norm, w_norm) + b_norm  
print(f"prediction using np.dot() and sgdr.predict match: {(y_pred == y_pred_sgd).all()}")

print(f"Prediction on training set:\n{y_pred[:4]}" )
print(f"Target values \n{y_train[:4]}")

Output:

prediction using np.dot() and sgdr.predict match: True
Prediction on training set:
[295.19 485.84 389.68 492.  ]
Target values 
[300.  509.8 394.  540. ]

# plot predictions and targets vs original features    
fig,ax=plt.subplots(1,4,figsize=(12,3),sharey=True)
for i in range(len(ax)):
    ax[i].scatter(X_train[:,i],y_train, label = 'target')
    ax[i].set_xlabel(X_features[i])
    ax[i].scatter(X_train[:,i],y_pred,color=dlc["dlorange"], label = 'predict')
ax[0].set_ylabel("Price"); ax[0].legend();
fig.suptitle("target versus prediction using z-score normalized model")
plt.show()

Lab: Building Linear Regression from Scratch

This lab walks through the implementation of linear regression for a single variable from the ground up.

1 - Packages

import numpy as np
import matplotlib.pyplot as plt
from utils import *
import copy
import math
%matplotlib inline

2 - Problem Statement

You are the CEO of a restaurant franchise. You have data on profits and populations from cities where your restaurants are located. You want to use this data to predict profits for new candidate cities.

3 - Dataset

x_train: Population of a city (in 10,000s).
y_train: Profit of a restaurant in that city (in $10,000s).

# load the dataset
x_train, y_train = load_data()

View and Visualize Data

print ('The shape of x_train is:', x_train.shape)
print ('The shape of y_train is: ', y_train.shape)
print ('Number of training examples (m):', len(x_train))

# Create a scatter plot of the data
plt.scatter(x_train, y_train, marker='x', c='r') 
plt.title("Profits vs. Population per city")
plt.ylabel('Profit in $10,000')
plt.xlabel('Population of City in 10,000s')
plt.show()

Output:

The shape of x_train is: (97,)
The shape of y_train is:  (97,)
Number of training examples (m): 97

4 - Linear Regression Refresher

5 - Compute Cost Function `J(w,b)`

J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})^2

Exercise 1: `compute_cost`

Implement the cost function.

def compute_cost(x, y, w, b): 
    """
    Computes the cost function for linear regression.
    """
    m = x.shape[0] 
    total_cost = 0
    
    for i in range(m):
        f_wb_i = w * x[i] + b
        cost_i = (f_wb_i - y[i]) ** 2
        total_cost += cost_i
    
    total_cost = total_cost / (2 * m)
    return total_cost

Test `compute_cost`

initial_w = 2
initial_b = 1

cost = compute_cost(x_train, y_train, initial_w, initial_b)
print(f'Cost at initial w: {cost:.3f}')

Output:

Cost at initial w: 75.203
All tests passed!

6 - Gradient Descent

Exercise 2: `compute_gradient`

Implement the function to compute the gradients $\frac{\partial J}{\partial w}$ and $\frac{\partial J}{\partial b}$ .

\frac{\partial J(w,b)}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)}

\frac{\partial J(w,b)}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})

def compute_gradient(x, y, w, b): 
    """
    Computes the gradient for linear regression 
    """
    m = x.shape[0]
    dj_dw = 0
    dj_db = 0
    
    for i in range(m):
        f_wb_i = w * x[i] + b
        err_i = f_wb_i - y[i]
        dj_dw += err_i * x[i]
        dj_db += err_i
    
    dj_dw  = dj_dw / m
    dj_db = dj_db / m
        
    return dj_dw, dj_db

Click for hints on implementing compute_gradient

Here's how you can structure the overall implementation for this function

def compute_gradient(x, y, w, b): 
   """
   Computes the gradient for linear regression 
   """
   # Number of training examples
   m = x.shape[0]

   # You need to return the following variables correctly
   dj_dw = 0
   dj_db = 0

   # Loop over examples
   for i in range(m):  
       # Get prediction f_wb for the ith example
       f_wb = w * x[i] + b 

       # Get the error for the ith example
       err = f_wb - y[i]
       
       # Get the gradient for w from the ith example 
       dj_dw_i = err * x[i]

       # Get the gradient for b from the ith example 
       dj_db_i = err

       # Update dj_db : In Python, a += 1  is the same as a = a + 1
       dj_db += dj_db_i

       # Update dj_dw
       dj_dw += dj_dw_i

   # Divide both dj_dw and dj_db by m
   dj_dw = dj_dw / m
   dj_db = dj_db / m

   return dj_dw, dj_db

Test `compute_gradient`

# Compute and display gradient with w initialized to zeroes
initial_w = 0
initial_b = 0

tmp_dj_dw, tmp_dj_db = compute_gradient(x_train, y_train, initial_w, initial_b)
print('Gradient at initial w, b (zeros):', tmp_dj_dw, tmp_dj_db)

# Compute and display cost and gradient with non-zero w
test_w = 0.2
test_b = 0.2
tmp_dj_dw, tmp_dj_db = compute_gradient(x_train, y_train, test_w, test_b)
print('Gradient at test w, b:', tmp_dj_dw, tmp_dj_db)

Output:

Gradient at initial w, b (zeros): -65.32884974555672 -5.83913505154639
Gradient at test w, b: -47.41610118114435 -4.007175051546391
All tests passed!

Learning Parameters with Batch Gradient Descent

Now we combine the cost and gradient functions to implement gradient descent.

def gradient_descent(x, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters): 
    """
    Performs batch gradient descent to learn theta. Updates theta by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      x :    (ndarray): Shape (m,)
      y :    (ndarray): Shape (m,)
      w_in, b_in : (scalar) Initial values of parameters of the model
      cost_function: function to compute cost
      gradient_function: function to compute the gradient
      alpha : (float) Learning rate
      num_iters : (int) number of iterations to run gradient descent
    Returns
      w : (ndarray): Shape (1,) Updated values of parameters of the model after
          running gradient descent
      b : (scalar)                Updated value of parameter of the model after
          running gradient descent
    """
    
    # number of training examples
    m = len(x)
    
    # An array to store cost J and w's at each iteration — primarily for graphing later
    J_history = []
    w_history = []
    w = copy.deepcopy(w_in)  #avoid modifying global w within function
    b = b_in
    
    for i in range(num_iters):

        # Calculate the gradient and update the parameters
        dj_dw, dj_db = gradient_function(x, y, w, b )  

        # Update Parameters using w, b, alpha and gradient
        w = w - alpha * dj_dw               
        b = b - alpha * dj_db               

        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            cost =  cost_function(x, y, w, b)
            J_history.append(cost)

        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters/10) == 0:
            w_history.append(w)
            print(f"Iteration {i:4}: Cost {float(J_history[-1]):8.2f}   ")
        
    return w, b, J_history, w_history #return w and J,w history for graphing

Run Gradient Descent

# initialize fitting parameters
initial_w = 0.
initial_b = 0.

# some gradient descent settings
iterations = 1500
alpha = 0.01

w,b,_,_ = gradient_descent(x_train ,y_train, initial_w, initial_b, 
                     compute_cost, compute_gradient, alpha, iterations)
print("w,b found by gradient descent:", w, b)

Output:

Iteration    0: Cost     6.74   
Iteration  150: Cost     5.31   
Iteration  300: Cost     4.96   
Iteration  450: Cost     4.76   
Iteration  600: Cost     4.64   
Iteration  750: Cost     4.57   
Iteration  900: Cost     4.53   
Iteration 1050: Cost     4.51   
Iteration 1200: Cost     4.50   
Iteration 1350: Cost     4.49   
w,b found by gradient descent: 1.166362350335582 -3.63029143940436

Plot the Linear Fit

m = x_train.shape[0]
predicted = np.zeros(m)

for i in range(m):
    predicted[i] = w * x_train[i] + b
    
# Plot the linear fit
plt.plot(x_train, predicted, c = "b")
# Create a scatter plot of the data. 
plt.scatter(x_train, y_train, marker='x', c='r') 
plt.title("Profits vs. Population per city")
plt.ylabel('Profit in $10,000')
plt.xlabel('Population of City in 10,000s')

Make Predictions

predict1 = 3.5 * w + b
print('For population = 35,000, we predict a profit of $%.2f' % (predict1*10000))

predict2 = 7.0 * w + b
print('For population = 70,000, we predict a profit of $%.2f' % (predict2*10000))

Output:

For population = 35,000, we predict a profit of $4519.77
For population = 70,000, we predict a profit of $45342.45

Practice Quiz

1. Which of the following is a valid step used during feature scaling?

A. Subtract the mean (average) from each value and then divide by the $(\text{max} - \text{min})$ This method is called Mean Normalization.

2. Suppose a friend ran gradient descent three separate times with three choices of the learning rate $α$ and plotted the learning curves. For which case, A or B, was the learning rate $α$ likely too large?

B. case B only The cost is increasing, which indicates that gradient descent is diverging, a classic sign of a learning rate that is too large.

3. Of the circumstances below, for which one is feature scaling particularly helpful?

A. Feature scaling is helpful when one feature is much larger (or smaller) than another feature. For example, "house size" in square feet (e.g., ~2000) is much larger than "number of bedrooms" (e.g., ~1-5). Scaling helps balance their influence during training.

4. You are helping a grocery store predict its revenue and have data on its items sold per week and price per item. What could be a useful engineered feature?

A. For each product, calculate the number of items sold times price per item.
This new feature directly represents the revenue for each product, which is likely a very strong predictor for the store's total revenue.

5. True/False? With polynomial regression, the predicted values $f_{\mathbf{w},b}(x)$ do not necessarily have to be a straight line (or linear) function of the input feature $x$ .

B. True
By creating polynomial features (like $x^2$ , $x^3$ , etc.), we can model non-linear relationships, resulting in a curved prediction line.

Gradient Descent in Practice

On this page