Gradient Descent in Practice
Advanced techniques for linear regression, including feature scaling, feature engineering, and polynomial regression, along with practical lab exercises.
Advanced Regression Techniques
Scaling Features for Better Performance
When features in a dataset have vastly different ranges, it can slow down the convergence of gradient descent. Feature scaling is a technique to bring all features into a similar range, which helps gradient descent find the optimal solution more quickly.
For example, consider predicting house prices with two features:
- Size (sq. ft.): Ranges from 300 to 5000.
- Number of Bedrooms: Ranges from 1 to 5.
The parameter associated with size will be very small, while the parameter associated with bedrooms will be much larger.

The Impact on the Cost Function
This disparity in feature scales leads to a cost function with elongated, skinny contours.
- A small change in (for
size) causes a large change in the cost. - A large change in (for
bedrooms) is needed to affect the cost similarly.

As a result, the contour plot looks like a set of tall, narrow ellipses. Gradient descent can struggle with such a surface, often bouncing back and forth inefficiently before reaching the minimum.
By scaling the features (e.g., transforming both to a range of 0 to 1), the contours of the cost function become more circular. This allows gradient descent to take a more direct path to the global minimum.
Goal of Feature Scaling
The main goal is to transform features so they have comparable ranges. This ensures that each feature contributes more equally to the model's learning process and helps gradient descent converge faster.
Methods for Feature Scaling
Here are three common methods for feature scaling:
-
Dividing by the Maximum
- Formula:
- This scales the feature to a range between 0 and 1 (or -1 and 1 if there are negative values). It's simple and effective for features that are strictly positive.

-
Mean Normalization
- Formula:
- This method centers the data around 0 and scales it by the range. The resulting features will generally be in the range of -1 to 1.

-
Z-Score Normalization
- Formula:
(where is the mean and is the standard deviation of feature )
- This is a very common and effective method. It rescales features to have a mean of 0 and a standard deviation of 1.

Note: When in doubt, applying feature scaling is generally a good idea and rarely hurts the model's performance.

Knowledge Check
Question: Which of the following is a valid step used during feature scaling?

A. Multiply each value by the maximum value for that feature.
B. Divide each value by the maximum value for that feature.
Answer:
B. By dividing all values by the maximum, the new range of the rescaled features will have a maximum value of 1.
Monitoring Gradient Descent
It's crucial to monitor gradient descent to ensure it's converging correctly.
Checking for Convergence
A learning curve is a plot of the cost function over the number of iterations.
- If gradient descent is working correctly, the cost should decrease after every iteration.
- The curve should eventually flatten out, indicating that the algorithm has converged to a minimum.

Choosing the Right Learning Rate ()
The learning rate is a critical hyperparameter.
- If is too large: The cost may increase or oscillate wildly. The algorithm might "overshoot" the minimum and diverge.
- If is too small: Gradient descent will be very slow to converge.

A good approach is to try a range of values (e.g., 0.001, 0.01, 0.1, 1.0) and plot their learning curves to find a value that causes the cost to decrease quickly and consistently.

Knowledge Check
Question: You run gradient descent for 15 iterations with = 0.3 and compute after each iteration. You find that the value of increases over time. How do you think you should adjust the learning rate ?
A. Try a larger value of = 1.0
B. Keep running it for additional iterations
C. Try a smaller value of = 0.1
D. Try running it for only 10 iterations so doesn't increase as much.
Answer:
C. Since the cost function is increasing, we know that gradient descent is diverging. This indicates that the learning rate is too high, so we should try a smaller value.
Lab: Feature Scaling and Learning Rate in Practice
Goals
- Run Gradient Descent on a dataset with multiple features.
- Explore the impact of the learning rate on convergence.
- Improve performance by applying Z-score normalization.
Problem Statement
You will use a housing dataset with four features to predict the price of a house.
| Feature | Description |
|---|---|
size(sqft) | Size of the house in square feet. |
bedrooms | Number of bedrooms. |
floors | Number of floors. |
age | Age of the house in years. |

Setup
import numpy as np
import matplotlib.pyplot as plt
from lab_utils_multi import load_house_data, run_gradient_descent
from lab_utils_multi import norm_plot, plt_equal_scale, plot_cost_i_w
from lab_utils_common import dlc
np.set_printoptions(precision=2)
plt.style.use('./deeplearning.mplstyle')Load and Visualize the Data
# load the dataset
X_train, y_train = load_house_data()
X_features = ['size(sqft)','bedrooms','floors','age']
# Plot each feature vs. the target, price
fig,ax=plt.subplots(1, 4, figsize=(12, 3), sharey=True)
for i in range(len(ax)):
ax[i].scatter(X_train[:,i],y_train)
ax[i].set_xlabel(X_features[i])
ax[0].set_ylabel("Price (1000's)")
plt.show()
The plots show that size and age have a stronger correlation with price than bedrooms or floors.
Gradient Descent for Multiple Variables
The update rules for gradient descent remain the same, but are now applied to each parameter and .
The Impact of the Learning Rate (α)
Let's run gradient descent with the raw (unscaled) data and observe the effect of different learning rates.
1. = 9.9e-7 (Too Large)
#set alpha to 9.9e-7
_, _, hist = run_gradient_descent(X_train, y_train, 10, alpha = 9.9e-7) The cost function increases with each iteration, a clear sign that the learning rate is too high and the algorithm is diverging.
Click to see full output
Iteration Cost w0 w1 w2 w3 b djdw0 djdw1 djdw2 djdw3 djdb
---------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
0 9.55884e+04 5.5e-01 1.0e-03 5.1e-04 1.2e-02 3.6e-04 -5.5e+05 -1.0e+03 -5.2e+02 -1.2e+04 -3.6e+02
1 1.28213e+05 -8.8e-02 -1.7e-04 -1.0e-04 -3.4e-03 -4.8e-05 6.4e+05 1.2e+03 6.2e+02 1.6e+04 4.1e+02
2 1.72159e+05 6.5e-01 1.2e-03 5.9e-04 1.3e-02 4.3e-04 -7.4e+05 -1.4e+03 -7.0e+02 -1.7e+04 -4.9e+02
3 2.31358e+05 -2.1e-01 -4.0e-04 -2.3e-04 -7.5e-03 -1.2e-04 8.6e+05 1.6e+03 8.3e+02 2.1e+04 5.6e+02
4 3.11100e+05 7.9e-01 1.4e-03 7.1e-04 1.5e-02 5.3e-04 -1.0e+06 -1.8e+03 -9.5e+02 -2.3e+04 -6.6e+02
5 4.18517e+05 -3.7e-01 -7.1e-04 -4.0e-04 -1.3e-02 -2.1e-04 1.2e+06 2.1e+03 1.1e+03 2.8e+04 7.5e+02
6 5.63212e+05 9.7e-01 1.7e-03 8.7e-04 1.8e-02 6.6e-04 -1.3e+06 -2.5e+03 -1.3e+03 -3.1e+04 -8.8e+02
7 7.58122e+05 -5.8e-01 -1.1e-03 -6.2e-04 -1.9e-02 -3.4e-04 1.6e+06 2.9e+03 1.5e+03 3.8e+04 1.0e+03
8 1.02068e+06 1.2e+00 2.2e-03 1.1e-03 2.3e-02 8.3e-04 -1.8e+06 -3.3e+03 -1.7e+03 -4.2e+04 -1.2e+03
9 1.37435e+06 -8.7e-01 -1.7e-03 -9.1e-04 -2.7e-02 -5.2e-04 2.1e+06 3.9e+03 2.0e+03 5.1e+04 1.4e+03
w,b found by gradient descent: w: [-0.87 -0. -0. -0.03], b: -0.00plot_cost_i_w(X_train, y_train, hist)
2. = 9e-7 (Moderate)
#set alpha to 9e-7
_,_,hist = run_gradient_descent(X_train, y_train, 10, alpha = 9e-7)The cost is now decreasing, but the parameter is oscillating around the optimal value, indicating the learning rate is still a bit high, causing it to "jump over" the minimum. It will eventually converge, but slowly.
Click to see full output
Iteration Cost w0 w1 w2 w3 b djdw0 djdw1 djdw2 djdw3 djdb
---------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
0 6.64616e+04 5.0e-01 9.1e-04 4.7e-04 1.1e-02 3.3e-04 -5.5e+05 -1.0e+03 -5.2e+02 -1.2e+04 -3.6e+02
1 6.18990e+04 1.8e-02 2.1e-05 2.0e-06 -7.9e-04 1.9e-05 5.3e+05 9.8e+02 5.2e+02 1.3e+04 3.4e+02
2 5.76572e+04 4.8e-01 8.6e-04 4.4e-04 9.5e-03 3.2e-04 -5.1e+05 -9.3e+02 -4.8e+02 -1.1e+04 -3.4e+02
3 5.37137e+04 3.4e-02 3.9e-05 2.8e-06 -1.6e-03 3.8e-05 4.9e+05 9.1e+02 4.8e+02 1.2e+04 3.2e+02
4 5.00474e+04 4.6e-01 8.2e-04 4.1e-04 8.0e-03 3.2e-04 -4.8e+05 -8.7e+02 -4.5e+02 -1.1e+04 -3.1e+02
5 4.66388e+04 5.0e-02 5.6e-05 2.5e-06 -2.4e-03 5.6e-05 4.6e+05 8.5e+02 4.5e+02 1.2e+04 2.9e+02
6 4.34700e+04 4.5e-01 7.8e-04 3.8e-04 6.4e-03 3.2e-04 -4.4e+05 -8.1e+02 -4.2e+02 -9.8e+03 -2.9e+02
7 4.05239e+04 6.4e-02 7.0e-05 1.2e-06 -3.3e-03 7.3e-05 4.3e+05 7.9e+02 4.2e+02 1.1e+04 2.7e+02
8 3.77849e+04 4.4e-01 7.5e-04 3.5e-04 4.9e-03 3.2e-04 -4.1e+05 -7.5e+02 -3.9e+02 -9.1e+03 -2.7e+02
9 3.52385e+04 7.7e-02 8.3e-05 -1.1e-06 -4.2e-03 8.9e-05 4.0e+05 7.4e+02 3.9e+02 1.0e+04 2.5e+02
w,b found by gradient descent: w: [ 7.74e-02 8.27e-05 -1.06e-06 -4.20e-03], b: 0.00plot_cost_i_w(X_train, y_train, hist)
3. = 1e-7 (Too Small)
#set alpha to 1e-7
_,_,hist = run_gradient_descent(X_train, y_train, 10, alpha = 1e-7)The cost decreases steadily, but very slowly. This would require many more iterations to converge.
Click to see full output
Iteration Cost w0 w1 w2 w3 b djdw0 djdw1 djdw2 djdw3 djdb
---------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
0 4.42313e+04 5.5e-02 1.0e-04 5.2e-05 1.2e-03 3.6e-05 -5.5e+05 -1.0e+03 -5.2e+02 -1.2e+04 -3.6e+02
1 2.76461e+04 9.8e-02 1.8e-04 9.2e-05 2.2e-03 6.5e-05 -4.3e+05 -7.9e+02 -4.0e+02 -9.5e+03 -2.8e+02
2 1.75102e+04 1.3e-01 2.4e-04 1.2e-04 2.9e-03 8.7e-05 -3.4e+05 -6.1e+02 -3.1e+02 -7.3e+03 -2.2e+02
3 1.13157e+04 1.6e-01 2.9e-04 1.5e-04 3.5e-03 1.0e-04 -2.6e+05 -4.8e+02 -2.4e+02 -5.6e+03 -1.8e+02
4 7.53002e+03 1.8e-01 3.3e-04 1.7e-04 3.9e-03 1.2e-04 -2.1e+05 -3.7e+02 -1.9e+02 -4.2e+03 -1.4e+02
5 5.21639e+03 2.0e-01 3.5e-04 1.8e-04 4.2e-03 1.3e-04 -1.6e+05 -2.9e+02 -1.5e+02 -3.1e+03 -1.1e+02
6 3.80242e+03 2.1e-01 3.8e-04 1.9e-04 4.5e-03 1.4e-04 -1.3e+05 -2.2e+02 -1.1e+02 -2.3e+03 -8.6e+01
7 2.93826e+03 2.2e-01 3.9e-04 2.0e-04 4.6e-03 1.4e-04 -9.8e+04 -1.7e+02 -8.6e+01 -1.7e+03 -6.8e+01
8 2.41013e+03 2.3e-01 4.1e-04 2.1e-04 4.7e-03 1.5e-04 -7.7e+04 -1.3e+02 -6.5e+01 -1.2e+03 -5.4e+01
9 2.08734e+03 2.3e-01 4.2e-04 2.1e-04 4.8e-03 1.5e-04 -6.0e+04 -1.0e+02 -4.9e+01 -7.5e+02 -4.3e+01
w,b found by gradient descent: w: [2.31e-01 4.18e-04 2.12e-04 4.81e-03], b: 0.00
This process highlights the difficulty of finding a good learning rate when features have very different scales.
Feature Scaling in Action
Let's apply Z-score normalization to solve this problem.
Implementation
def zscore_normalize_features(X):
"""
computes X, zcore normalized by column
Args:
X (ndarray (m,n)) : input data, m examples, n features
Returns:
X_norm (ndarray (m,n)): input normalized by column
mu (ndarray (n,)) : mean of each feature
sigma (ndarray (n,)) : standard deviation of each feature
"""
# find the mean of each column/feature
mu = np.mean(X, axis=0)
# find the standard deviation of each column/feature
sigma = np.std(X, axis=0)
# element-wise, subtract mu for that column from each example, divide by std for that column
X_norm = (X - mu) / sigma
return (X_norm, mu, sigma)Visualizing the Transformation
The normalization process centers each feature around zero and gives it a standard deviation of one.

- Left (Unnormalized): The scale of
size(sqft)is vastly different fromage. - Middle (Mean Subtracted): The features are centered around zero.
- Right (Z-score Normalized): Both features are now centered at zero and have a similar scale.
Applying Normalization to the Data
# normalize the original features
X_norm, X_mu, X_sigma = zscore_normalize_features(X_train)
print(f"X_mu = {X_mu}, \nX_sigma = {X_sigma}")
print(f"Peak to Peak range by column in Raw X:{np.ptp(X_train,axis=0)}")
print(f"Peak to Peak range by column in Normalized X:{np.ptp(X_norm,axis=0)}")Output:
X_mu = [1.42e+03 2.72e+00 1.38e+00 3.84e+01],
X_sigma = [411.62 0.65 0.49 25.78]
Peak to Peak range by column in Raw X:[2.41e+03 4.00e+00 1.00e+00 9.50e+01]
Peak to Peak range by column in Normalized X:[5.85 6.14 2.06 3.69]The peak-to-peak range is now much more consistent across features.

Rerunning Gradient Descent with Normalized Data
With scaled features, we can use a much larger learning rate, α = 1.0e-1, which drastically speeds up convergence.
w_norm, b_norm, hist = run_gradient_descent(X_norm, y_train, 1000, 1.0e-1)Click to see full output
Iteration Cost w0 w1 w2 w3 b djdw0 djdw1 djdw2 djdw3 djdb
---------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
0 5.76170e+04 8.9e+00 3.0e+00 3.3e+00 -6.0e+00 3.6e+01 -8.9e+01 -3.0e+01 -3.3e+01 6.0e+01 -3.6e+02
100 2.21086e+02 1.1e+02 -2.0e+01 -3.1e+01 -3.8e+01 3.6e+02 -9.2e-01 4.5e-01 5.3e-01 -1.7e-01 -9.6e-03
200 2.19209e+02 1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01 3.6e+02 -3.0e-02 1.5e-02 1.7e-02 -6.0e-03 -2.6e-07
300 2.19207e+02 1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01 3.6e+02 -1.0e-03 5.1e-04 5.7e-04 -2.0e-04 -6.9e-12
400 2.19207e+02 1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01 3.6e+02 -3.4e-05 1.7e-05 1.9e-05 -6.6e-06 -2.7e-13
500 2.19207e+02 1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01 3.6e+02 -1.1e-06 5.6e-07 6.2e-07 -2.2e-07 -2.6e-13
600 2.19207e+02 1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01 3.6e+02 -3.7e-08 1.9e-08 2.1e-08 -7.3e-09 -2.6e-13
700 2.19207e+02 1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01 3.6e+02 -1.2e-09 6.2e-10 6.9e-10 -2.4e-10 -2.6e-13
800 2.19207e+02 1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01 3.6e+02 -4.1e-11 2.1e-11 2.3e-11 -8.1e-12 -2.7e-13
900 2.19207e+02 1.1e+02 -2.1e+01 -3.3e+01 -3.8e+01 3.6e+02 -1.4e-12 7.0e-13 7.6e-13 -2.7e-13 -2.6e-13
w,b found by gradient descent: w: [110.56 -21.27 -32.71 -37.97], b: 363.16The model converges very quickly to a low cost. The scaled features allow for much faster and more stable training.
Predictions vs. Target Values
#predict target using normalized features
m = X_norm.shape[0]
yp = np.zeros(m)
for i in range(m):
yp[i] = np.dot(X_norm[i], w_norm) + b_norm
# plot predictions and targets versus original features
fig,ax=plt.subplots(1,4,figsize=(12, 3),sharey=True)
for i in range(len(ax)):
ax[i].scatter(X_train[:,i],y_train, label = 'target')
ax[i].set_xlabel(X_features[i])
ax[i].scatter(X_train[:,i],yp,color=dlc["dlorange"], label = 'predict')
ax[0].set_ylabel("Price"); ax[0].legend();
fig.suptitle("target versus prediction using z-score normalized model")
plt.show()
The model provides good predictions across all features.
Predicting on New Data
To predict the price of a new house, you must normalize its features using the same and calculated from the training set.
# Predict the price of a house with 1200 sqft, 3 bedrooms, 1 floor, 40 years old.
x_house = np.array([1200, 3, 1, 40])
x_house_norm = (x_house - X_mu) / X_sigma
print(x_house_norm)
x_house_predict = np.dot(x_house_norm, w_norm) + b_norm
print(f" predicted price of a house with 1200 sqft, 3 bedrooms, 1 floor, 40 years old = ${x_house_predict*1000:0.0f}")Output:
[-0.53 0.43 -0.79 0.06]
predicted price of a house with 1200 sqft, 3 bedrooms, 1 floor, 40 years old = $318709Cost Contours Comparison
These plots visually confirm why feature scaling works. The contours for the unscaled data are elongated, while the normalized data yields circular contours, making the path to the minimum much more direct.

Feature Engineering & Polynomial Regression
Linear regression can be extended to model non-linear relationships through feature engineering, which is the process of creating new features by transforming or combining existing ones.
Feature Engineering
The choice of features significantly impacts a model's performance. By using domain knowledge and intuition, we can create features that better capture the underlying patterns in the data.

Knowledge Check
Question: If you have measurements for the dimensions of a swimming pool (length, width, height), which of the following two would be a more useful engineered feature?
A.
B.
Answer:
A. The volume of the swimming pool is likely a more useful feature for many prediction tasks (e.g., predicting the cost to fill it) than the sum of its dimensions.
Polynomial Regression
What if your data doesn't follow a straight line? You can still use linear regression by engineering new polynomial features.
For example, if your data seems to follow a quadratic curve:

Instead of the model , you can create a new feature and fit the model:
Even though the function is a curve with respect to , it is a linear function with respect to the features and . This allows us to use the same linear regression algorithm.

Lab: Implementing Polynomial Regression
Goals
- Explore how to use feature engineering to fit non-linear data.
- Understand how linear regression can model complex functions through polynomial features.
Setup
import numpy as np
import matplotlib.pyplot as plt
from lab_utils_multi import zscore_normalize_features, run_gradient_descent_feng
np.set_printoptions(precision=2) # reduced display precision on numpy arrays

Linear Model on Non-linear Data
Let's try to fit a linear model to quadratic data .
# create target data
x = np.arange(0, 20, 1)
y = 1 + x**2
X = x.reshape(-1, 1)
model_w,model_b = run_gradient_descent_feng(X,y,iterations=1000, alpha = 1e-2)
plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("no feature engineering")
plt.plot(x,X@model_w + model_b, label="Predicted Value"); plt.xlabel("X"); plt.ylabel("y"); plt.legend(); plt.show()Output:
w,b found by gradient descent: w: [18.7], b: -52.0834
As expected, a straight line is a poor fit for this data.
Adding a Polynomial Feature (x²)
Now, let's engineer a new feature, , and train the model again.
# create target data
x = np.arange(0, 20, 1)
y = 1 + x**2
# Engineer features
X = x**2 #<-- added engineered feature
X = X.reshape(-1, 1) #X should be a 2-D Matrix
model_w,model_b = run_gradient_descent_feng(X, y, iterations=10000, alpha = 1e-5)
plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("Added x**2 feature")
plt.plot(x, np.dot(X,model_w) + model_b, label="Predicted Value"); plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()Output:
w,b found by gradient descent: w: [1.], b: 0.0490
The fit is now nearly perfect! The learned parameters and are very close to the true model .
Selecting Features
What if we aren't sure which polynomial terms are needed? We can add several and let gradient descent figure it out. Let's try fitting with features , , and .
# create target data
x = np.arange(0, 20, 1)
y = x**2
# engineer features
X = np.c_[x, x**2, x**3]
model_w,model_b = run_gradient_descent_feng(X, y, iterations=10000, alpha=1e-7)
plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("x, x**2, x**3 features")
plt.plot(x, X@model_w + model_b, label="Predicted Value"); plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()Output:
w,b found by gradient descent: w: [0.08 0.54 0.03], b: 0.0106
The learned model is . Gradient descent has assigned the largest weight () to the feature, correctly identifying it as the most important one. The weights for and are much smaller.
An Alternate View
The best features for linear regression are those that have a linear relationship with the target . Plotting our engineered features against confirms this.
# create target data
x = np.arange(0, 20, 1)
y = x**2
# engineer features
X = np.c_[x, x**2, x**3]
X_features = ['x','x^2','x^3']
fig,ax=plt.subplots(1, 3, figsize=(12, 3), sharey=True)
for i in range(len(ax)):
ax[i].scatter(X[:,i],y)
ax[i].set_xlabel(X_features[i])
ax[0].set_ylabel("y")
plt.show()
Clearly, has a linear relationship with , making it the perfect feature for a linear regression model.
Scaling Features
When creating polynomial features like , , and , their scales will be vastly different. Feature scaling is essential here to speed up gradient descent.
# create target data
x = np.arange(0,20,1)
y = x**2
X = np.c_[x, x**2, x**3]
print(f"Peak to Peak range by column in Raw X:{np.ptp(X,axis=0)}")
# add z-score normalization
X = zscore_normalize_features(X)
print(f"Peak to Peak range by column in Normalized X:{np.ptp(X,axis=0)}")Output:
Peak to Peak range by column in Raw X:[ 19 361 6859]
Peak to Peak range by column in Normalized X:[3.3 3.18 3.28]Now, with scaled features, we can use a much larger learning rate and converge faster.
model_w, model_b = run_gradient_descent_feng(X, y, iterations=100000, alpha=1e-1)
# ... plotting code ...Output:
w,b found by gradient descent: w: [5.27e-05 1.13e+02 8.43e-05], b: 123.5000
After normalization, gradient descent gives a much larger weight to the term and almost zero weight to the others, resulting in a very accurate model.
Modeling Complex Functions
With enough polynomial features, we can model even highly complex functions, like a cosine wave.
x = np.arange(0,20,1)
y = np.cos(x/2)
# Engineer features up to x^13
X = np.c_[x, x**2, x**3,x**4, x**5, x**6, x**7, x**8, x**9, x**10, x**11, x**12, x**13]
# Normalize them
X = zscore_normalize_features(X)
model_w,model_b = run_gradient_descent_feng(X, y, iterations=1000000, alpha = 1e-1)
# ... plotting code ...
Lab: Linear Regression with Scikit-Learn
Instead of implementing algorithms from scratch, we can use powerful, open-source libraries like scikit-learn.
Goals
- Utilize scikit-learn to implement linear regression using Gradient Descent.
Setup
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from lab_utils_multi import load_house_data
from lab_utils_common import dlc
np.set_printoptions(precision=2)
plt.style.use('./deeplearning.mplstyle')1. Load the Data
X_train, y_train = load_house_data()
X_features = ['size(sqft)','bedrooms','floors','age']2. Scale/Normalize the Data
Scikit-learn's StandardScaler performs Z-score normalization.
scaler = StandardScaler()
X_norm = scaler.fit_transform(X_train)
print(f"Peak to Peak range by column in Raw X:{np.ptp(X_train,axis=0)}")
print(f"Peak to Peak range by column in Normalized X:{np.ptp(X_norm,axis=0)}")Output:
Peak to Peak range by column in Raw X:[2.41e+03 4.00e+00 1.00e+00 9.50e+01]
Peak to Peak range by column in Normalized X:[5.85 6.14 2.06 3.69]3. Create and Fit the Model
SGDRegressor implements linear regression using Stochastic Gradient Descent.
# Create an instance of the model
sgdr = SGDRegressor(max_iter=1000)
# Fit the model to the normalized data
sgdr.fit(X_norm, y_train)
print(sgdr)
print(f"number of iterations completed: {sgdr.n_iter_}, number of weight updates: {sgdr.t_}")Output:
SGDRegressor()
number of iterations completed: 106, number of weight updates: 10495.04. View Parameters
The learned parameters are stored in sgdr.coef_ (for ) and sgdr.intercept_ (for ).
b_norm = sgdr.intercept_
w_norm = sgdr.coef_
print(f"model parameters: w: {w_norm}, b:{b_norm}")
print( "model parameters from previous lab: w: [110.56 -21.27 -32.71 -37.97], b: 363.16")Output:
model parameters: w: [109.79 -20.87 -32.26 -38.1 ], b:[363.16]
model parameters from previous lab: w: [110.56 -21.27 -32.71 -37.97], b: 363.16The results are very close to our manual implementation!
5. Make Predictions and Plot Results
# make a prediction using sgdr.predict()
y_pred_sgd = sgdr.predict(X_norm)
# make a prediction using w,b.
y_pred = np.dot(X_norm, w_norm) + b_norm
print(f"prediction using np.dot() and sgdr.predict match: {(y_pred == y_pred_sgd).all()}")
print(f"Prediction on training set:\n{y_pred[:4]}" )
print(f"Target values \n{y_train[:4]}")Output:
prediction using np.dot() and sgdr.predict match: True
Prediction on training set:
[295.19 485.84 389.68 492. ]
Target values
[300. 509.8 394. 540. ]# plot predictions and targets vs original features
fig,ax=plt.subplots(1,4,figsize=(12,3),sharey=True)
for i in range(len(ax)):
ax[i].scatter(X_train[:,i],y_train, label = 'target')
ax[i].set_xlabel(X_features[i])
ax[i].scatter(X_train[:,i],y_pred,color=dlc["dlorange"], label = 'predict')
ax[0].set_ylabel("Price"); ax[0].legend();
fig.suptitle("target versus prediction using z-score normalized model")
plt.show()
Lab: Building Linear Regression from Scratch
This lab walks through the implementation of linear regression for a single variable from the ground up.
1 - Packages
import numpy as np
import matplotlib.pyplot as plt
from utils import *
import copy
import math
%matplotlib inline2 - Problem Statement
You are the CEO of a restaurant franchise. You have data on profits and populations from cities where your restaurants are located. You want to use this data to predict profits for new candidate cities.
3 - Dataset
x_train: Population of a city (in 10,000s).y_train: Profit of a restaurant in that city (in $10,000s).
# load the dataset
x_train, y_train = load_data()View and Visualize Data
print ('The shape of x_train is:', x_train.shape)
print ('The shape of y_train is: ', y_train.shape)
print ('Number of training examples (m):', len(x_train))
# Create a scatter plot of the data
plt.scatter(x_train, y_train, marker='x', c='r')
plt.title("Profits vs. Population per city")
plt.ylabel('Profit in $10,000')
plt.xlabel('Population of City in 10,000s')
plt.show()Output:
The shape of x_train is: (97,)
The shape of y_train is: (97,)
Number of training examples (m): 97
4 - Linear Regression Refresher

5 - Compute Cost Function J(w,b)
Exercise 1: compute_cost
Implement the cost function.

def compute_cost(x, y, w, b):
"""
Computes the cost function for linear regression.
"""
m = x.shape[0]
total_cost = 0
for i in range(m):
f_wb_i = w * x[i] + b
cost_i = (f_wb_i - y[i]) ** 2
total_cost += cost_i
total_cost = total_cost / (2 * m)
return total_costTest compute_cost
initial_w = 2
initial_b = 1
cost = compute_cost(x_train, y_train, initial_w, initial_b)
print(f'Cost at initial w: {cost:.3f}')Output:
Cost at initial w: 75.203
All tests passed!6 - Gradient Descent

Exercise 2: compute_gradient
Implement the function to compute the gradients and .
def compute_gradient(x, y, w, b):
"""
Computes the gradient for linear regression
"""
m = x.shape[0]
dj_dw = 0
dj_db = 0
for i in range(m):
f_wb_i = w * x[i] + b
err_i = f_wb_i - y[i]
dj_dw += err_i * x[i]
dj_db += err_i
dj_dw = dj_dw / m
dj_db = dj_db / m
return dj_dw, dj_dbClick for hints on implementing compute_gradient

Here's how you can structure the overall implementation for this function
def compute_gradient(x, y, w, b):
"""
Computes the gradient for linear regression
"""
# Number of training examples
m = x.shape[0]
# You need to return the following variables correctly
dj_dw = 0
dj_db = 0
# Loop over examples
for i in range(m):
# Get prediction f_wb for the ith example
f_wb = w * x[i] + b
# Get the error for the ith example
err = f_wb - y[i]
# Get the gradient for w from the ith example
dj_dw_i = err * x[i]
# Get the gradient for b from the ith example
dj_db_i = err
# Update dj_db : In Python, a += 1 is the same as a = a + 1
dj_db += dj_db_i
# Update dj_dw
dj_dw += dj_dw_i
# Divide both dj_dw and dj_db by m
dj_dw = dj_dw / m
dj_db = dj_db / m
return dj_dw, dj_dbTest compute_gradient
# Compute and display gradient with w initialized to zeroes
initial_w = 0
initial_b = 0
tmp_dj_dw, tmp_dj_db = compute_gradient(x_train, y_train, initial_w, initial_b)
print('Gradient at initial w, b (zeros):', tmp_dj_dw, tmp_dj_db)
# Compute and display cost and gradient with non-zero w
test_w = 0.2
test_b = 0.2
tmp_dj_dw, tmp_dj_db = compute_gradient(x_train, y_train, test_w, test_b)
print('Gradient at test w, b:', tmp_dj_dw, tmp_dj_db)Output:
Gradient at initial w, b (zeros): -65.32884974555672 -5.83913505154639
Gradient at test w, b: -47.41610118114435 -4.007175051546391
All tests passed!Learning Parameters with Batch Gradient Descent
Now we combine the cost and gradient functions to implement gradient descent.
def gradient_descent(x, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters):
"""
Performs batch gradient descent to learn theta. Updates theta by taking
num_iters gradient steps with learning rate alpha
Args:
x : (ndarray): Shape (m,)
y : (ndarray): Shape (m,)
w_in, b_in : (scalar) Initial values of parameters of the model
cost_function: function to compute cost
gradient_function: function to compute the gradient
alpha : (float) Learning rate
num_iters : (int) number of iterations to run gradient descent
Returns
w : (ndarray): Shape (1,) Updated values of parameters of the model after
running gradient descent
b : (scalar) Updated value of parameter of the model after
running gradient descent
"""
# number of training examples
m = len(x)
# An array to store cost J and w's at each iteration — primarily for graphing later
J_history = []
w_history = []
w = copy.deepcopy(w_in) #avoid modifying global w within function
b = b_in
for i in range(num_iters):
# Calculate the gradient and update the parameters
dj_dw, dj_db = gradient_function(x, y, w, b )
# Update Parameters using w, b, alpha and gradient
w = w - alpha * dj_dw
b = b - alpha * dj_db
# Save cost J at each iteration
if i<100000: # prevent resource exhaustion
cost = cost_function(x, y, w, b)
J_history.append(cost)
# Print cost every at intervals 10 times or as many iterations if < 10
if i% math.ceil(num_iters/10) == 0:
w_history.append(w)
print(f"Iteration {i:4}: Cost {float(J_history[-1]):8.2f} ")
return w, b, J_history, w_history #return w and J,w history for graphingRun Gradient Descent
# initialize fitting parameters
initial_w = 0.
initial_b = 0.
# some gradient descent settings
iterations = 1500
alpha = 0.01
w,b,_,_ = gradient_descent(x_train ,y_train, initial_w, initial_b,
compute_cost, compute_gradient, alpha, iterations)
print("w,b found by gradient descent:", w, b)Output:
Iteration 0: Cost 6.74
Iteration 150: Cost 5.31
Iteration 300: Cost 4.96
Iteration 450: Cost 4.76
Iteration 600: Cost 4.64
Iteration 750: Cost 4.57
Iteration 900: Cost 4.53
Iteration 1050: Cost 4.51
Iteration 1200: Cost 4.50
Iteration 1350: Cost 4.49
w,b found by gradient descent: 1.166362350335582 -3.63029143940436Plot the Linear Fit
m = x_train.shape[0]
predicted = np.zeros(m)
for i in range(m):
predicted[i] = w * x_train[i] + b
# Plot the linear fit
plt.plot(x_train, predicted, c = "b")
# Create a scatter plot of the data.
plt.scatter(x_train, y_train, marker='x', c='r')
plt.title("Profits vs. Population per city")
plt.ylabel('Profit in $10,000')
plt.xlabel('Population of City in 10,000s')
Make Predictions
predict1 = 3.5 * w + b
print('For population = 35,000, we predict a profit of $%.2f' % (predict1*10000))
predict2 = 7.0 * w + b
print('For population = 70,000, we predict a profit of $%.2f' % (predict2*10000))Output:
For population = 35,000, we predict a profit of $4519.77
For population = 70,000, we predict a profit of $45342.45Practice Quiz
1. Which of the following is a valid step used during feature scaling?

A. Subtract the mean (average) from each value and then divide by the This method is called Mean Normalization.
2. Suppose a friend ran gradient descent three separate times with three choices of the learning rate and plotted the learning curves. For which case, A or B, was the learning rate likely too large?

B. case B only The cost is increasing, which indicates that gradient descent is diverging, a classic sign of a learning rate that is too large.
3. Of the circumstances below, for which one is feature scaling particularly helpful?
A. Feature scaling is helpful when one feature is much larger (or smaller) than another feature. For example, "house size" in square feet (e.g., ~2000) is much larger than "number of bedrooms" (e.g., ~1-5). Scaling helps balance their influence during training.
4. You are helping a grocery store predict its revenue and have data on its items sold per week and price per item. What could be a useful engineered feature?
A. For each product, calculate the number of items sold times price per item.
This new feature directly represents the revenue for each product, which is likely a very strong predictor for the store's total revenue.
5. True/False? With polynomial regression, the predicted values do not necessarily have to be a straight line (or linear) function of the input feature .
B. True
By creating polynomial features (like , , etc.), we can model non-linear relationships, resulting in a curved prediction line.
Multiple Linear Regression
This section covers linear regression with multiple input variables, the concept of vectorization for computational efficiency, a practical guide to the NumPy library, and a hands-on lab implementation.
Classification with Logistic Regression
Classification is a type of supervised learning where the the model's goal is to predict which category a new observation belongs to.