Neural Networks and Initialization

In the context of Univariate Logistic Regression with one input layer and one output layer, which activation function

\sigma(z)

is defined as

\frac{1}{1 + \exp(-z)}

?

ReLU Softmax Tanh Sigmoid

According to the lecture slides, what distinguishes a "Shallow" Neural Network from a standard "Machine Learning" (Logistic Regression) model?

The presence of more than two hidden layers. The presence of one hidden layer. The use of a sigmoid activation function. The lack of an output layer.

If a neural network has an input layer

a^{[0]}

, a hidden layer

a^{[1]}

, and an output layer

a^{[2]}

, which layer represents the final prediction?

The bias unit

b^{[1]}

The input layer

a^{[0]}

The hidden layer

a^{[1]}

The output layer

a^{[2]}

In a Multivariate Logistic Regression model, if the output activation

a^{[1]} \geq 0.5

, what is the predicted class

\hat{y}

?

-1 0.5 0 1

Consider a Shallow Neural Network with 3 input neurons (

n_x=3

) and 2 hidden neurons (

n_h=2

). According to the matrix form presented in the slides, what are the dimensions of the weight matrix

W^{[1]}

?

[2, 3]

[4, 2]

[3, 2]

[2, 4]

Following the previous question (3 inputs, 2 hidden neurons), if the output layer has 1 neuron (

n_o=1

), what are the dimensions of the weight matrix

W^{[2]}

?

[2, 1]

[1, 3]

[1, 2]

[3, 1]

Why is it recommended to initialize weights randomly rather than setting them all to zero?

To break symmetry so neurons compute different features. Zero initialization causes the computer to divide by zero. To ensure the weights sum to one. To prevent the gradients from becoming too large (exploding gradients).

Which Python/Numpy command is suggested in the slides to initialize weights with small random numbers from a Gaussian distribution?

W = np.zeros((D, H)) W = np.random.rand(D, H) W = 0.01 * np.random.randn(D, H) W = np.ones((D, H))

What is the primary purpose of setting a random seed (e.g., torch.manual_seed(seed)) in neural network training?

To automatically optimize the learning rate. To ensure reproducibility and determinism for debugging. To prevent the model from overfitting. To make the training faster.

In the context of the slides, when is a hidden layer particularly useful?

When you want to use a linear activation function. When you want to reduce the computational cost. When there is no direct or strong correlation between the input layer and the output layer. When the dataset is very small.