Cost Function for Logistic Regression
This section cover the Cost Function for Logistic Regression
The Cost Function provides a way to measure how well a specific set of parameters fits the training data. It thereby acts as a compass, helping us choose a better set of parameters during optimization.
Why Not Mean Squared Error (MSE)?


In Linear Regression, we used the Mean Squared Error (MSE) function. You might consider using the same for Logistic Regression.
However, if we plot the cost values using the logistic regression output (which includes the non-linear sigmoid function), we get a non-convex cost function.
If we try to use MSE with this non-linear function:
- The cost function will have many local minima (valleys that are not the deepest point).
- Gradient Descent will likely get stuck in these local minima and fail to find the optimal parameters.
Therefore, we cannot use the Mean Squared Error Cost Function for Logistic Regression.
Logistic Loss Function
To handle classification (where is 0 or 1), we define a specific Loss function for a single training example.

The Logistic Loss is defined piece-wise:
Case 1: Target

When the true label is 1, the loss is:
- As : The loss .
- As : The loss .
Insight: Loss is lowest when the prediction is close to the true label .
Case 2: Target

When the true label is 0, the loss is:
- If : The loss .
- If : The loss .
So, the further the prediction is from the true target value, the higher the loss.
Visualizing the Cost
The final Cost function is the average of these losses over the training set.

Concept Check
Question: Why is the squared error cost not used in logistic regression?
- A) The non-linear nature of the model results in a “wiggly”, non-convex cost function with many potential local minima.
- B) The mean squared error is used for logistic regression.
Correct Answer: A
Explanation: Using mean squared error for logistic regression creates a "non-convex" cost function, making it difficult for gradient descent to find the optimal parameters and .
Lab: Logistic Loss Exploration
In this section, we explore why squared error is unsuitable and examine the logistic loss function.
1. The "Soup Bowl" of Linear Regression
Recall that for Linear Regression, we used the squared error cost function:
This function is convex (shaped like a soup bowl), meaning derivative-based optimization always finds the bottom.

2. Trying Squared Error on Logistic Data
If we apply the squared error formula to Logistic Regression (where is the sigmoid function), the surface changes.
import numpy as np
# %matplotlib widget
import matplotlib.pyplot as plt
from plt_logistic_loss import plt_logistic_cost, plt_two_logistic_loss_curves, plt_simple_example
from plt_logistic_loss import soup_bowl, plt_logistic_squared_error
plt.style.use('./deeplearning.mplstyle')
# Training Data
x_train = np.array([0., 1, 2, 3, 4, 5], dtype=np.longdouble)
y_train = np.array([0, 0, 0, 1, 1, 1], dtype=np.longdouble)
# Visualize categorical data
plt_simple_example(x_train, y_train)
Now, let's plot the cost surface using squared error cost:
Where,
plt.close('all')
plt_logistic_squared_error(x_train,y_train)
plt.show()
Observation: The surface is not smooth like the 'soup bowl'! It has plateaus and local minima. This confirms why squared error is bad for logistic regression.
3. The Logistic Loss Function Curves
Logistic Regression requires a loss function suited for categorization ( or ).
- Loss: Measure of difference for a single example.
- Cost: Measure of losses over the entire training set.
The logistic loss is defined as:

This creates two curves:
plt_two_logistic_loss_curves()
Combined, these curves provide a convex shape (like a bowl) when aggregated into the Cost function, which is perfect for Gradient Descent.
Simplified Cost Function
Writing code with if/else statements for every data point is inefficient. We can compress the two cases into a single mathematical equation.
The Combined Equation
Why does this work?
-
If : The second term becomes 0. The equation simplifies to: (Matches the y=1 case)
-
If : The first term becomes 0. The equation simplifies to: (Matches the y=0 case)

The Global Cost Function
The total cost is the average over all examples:

Key Properties:
- Maximum Likelihood Estimation: It is derived statistically to find the most probable parameters.
- Convexity: It ensures a single global minimum.
Quiz: Simplified Loss
Question: If the target , what does the simplified expression simplify to?
- A)
- B)
Correct Answer: A
Derivation: Substitute . The second term becomes zero because . The first term remains as .
Lab: Implementing the Cost Function
Let's implement this in Python.
Dataset:
has shape .
has shape .
import numpy as np
# %matplotlib widget
import matplotlib.pyplot as plt
from lab_utils_common import plot_data, sigmoid, dlc
plt.style.use('./deeplearning.mplstyle')
X_train = np.array([[0.5, 1.5], [1,1], [1.5, 0.5], [3, 0.5], [2, 2], [1, 2.5]])
y_train = np.array([0, 0, 0, 1, 1, 1])
fig,ax = plt.subplots(1,1,figsize=(4,4))
plot_data(X_train, y_train, ax)
ax.axis([0, 4, 0, 3.5])
ax.set_ylabel('$x_1$', fontsize=12)
ax.set_xlabel('$x_0$', fontsize=12)
plt.show()Python Implementation
The algorithm loops over all examples, calculating the loss for each and accumulating the total cost.
def compute_cost_logistic(X, y, w, b):
"""
Computes cost
Args:
X (ndarray (m,n)): Data, m examples with n features
y (ndarray (m,)) : target values
w (ndarray (n,)) : model parameters
b (scalar) : model parameter
Returns:
cost (scalar): cost
"""
m = X.shape[0]
cost = 0.0
for i in range(m):
z_i = np.dot(X[i],w) + b
f_wb_i = sigmoid(z_i)
cost += -y[i]*np.log(f_wb_i) - (1-y[i])*np.log(1-f_wb_i)
cost = cost / m
return costTesting the function:
w_tmp = np.array([1,1])
b_tmp = -3
print(compute_cost_logistic(X_train, y_train, w_tmp, b_tmp))Expected Output: 0.3668667864055175
Comparing Parameters
Let's verify if the cost function correctly identifies a bad model.
- Model 1: (Visual fit: Good)
- Model 2: (Visual fit: Bad)

w_array1 = np.array([1,1])
b_1 = -3
w_array2 = np.array([1,1])
b_2 = -4
print("Cost for b = -3 : ", compute_cost_logistic(X_train, y_train, w_array1, b_1))
print("Cost for b = -4 : ", compute_cost_logistic(X_train, y_train, w_array2, b_2))Results:
- Cost for : 0.367 (Lower cost)
- Cost for : 0.504 (Higher cost)
The cost function accurately reflects that Model 1 is better.
Practice Quiz
Question 1

In this lecture series, "cost" and "loss" have distinct meanings. Which one applies to a single training example?
- Loss
- Cost
- Both Loss and Cost
- Neither Loss nor Cost
Correct Answer: 1) Loss
Note: Loss applies to a single example. Cost applies to the average of losses over the entire training set.
Question 2

For the simplified loss function, if the label , then what does this expression simplify to?
- A)
- B)
- C)
- D)
Correct Answer: C
Derivation: Substitute .
- The first term becomes .
- The second term becomes .
- Final Result: .