Reading Presentation for Final Homework
Presentation file, questions and answers
2016/5/17 - 5/31
- 5/31: Neural Networks and Deep Learning, Michael Nielsen, 2015.
- Ch 4: A visual proof that neural nets can compute any function Presenter: Kirito
- Question 1 (Jonathan) : In your examples, most approximations seem to be very step like. So the question is according to your knowledge, if I would want to get a smoother (better) approximation of a function, what would be the best approach? More layers? More neurons?
- Answer: I think more layers and more neurons can get a smoother approximation of a function, may make the calculation more complicated or speed down. But as far as I know, if use more hidden layers, maybe we can achieve the same results with fewer hidden layers. However, by increasing the number of hidden neurons, we can typically get a better approximation. So I think it might be a better choice to increase more neurons.
- Question 2 (Daniel) : If shallow network and deep network can get same result, which one is better ?
- Answer: Shallow neural networks usually have only one hidden layer and deep neural networks have several hidden layers. Deep neural networks with the right architectures achieve better results than shallow ones that have the same computational power. The main explanation is that the deep models are able to build better features than shallow models and to achieve this they are using the intermediate hidden layers. But if they can get the same result, we can according to the Occam's Razor: simpler theories are preferable to more complex ones because they are more testable. So I think shallow network is better in this question.
- Question 3 (Tommy) : Can neural networks compute the discontinued function?
- Answer: If the function is discontinuous, it generally does not use neural networks to approximate, because neural networks compute continuous functions of their input. However, even if we really want to compute discontinuous function, it's often the case that a continuous approximation is good enough. If that's so, then we can use a neural network to compute discontinuous function, so this is not usually an important limitation.
- Question 4 (Lolly) : In addition to increasing the hidden layer, have any other ways can be more approximate functions?
- Answer: I mentioned that the step function is a good approximation, but it is only an approximation. In fact, there will be a narrow window of failure. we can use a set of hidden neurons to compute an approximation to half our original goal function. And we use another set of hidden neurons to do same thing, but with the bases of the bumps shifted by half the width of a bump. Finally, add up the two approximations we'll get an overall approximation. It still have failures in small windows. But the problem will be much less than before. We could do even better by adding up more overlapping approximations to the function, and the result will be an excellent overall approximation.
- Question 1 (Jonathan) : In your examples, most approximations seem to be very step like. So the question is according to your knowledge, if I would want to get a smoother (better) approximation of a function, what would be the best approach? More layers? More neurons?
- Ch 5: Why are deep neural networks hard to train? Presenter: Daniel
- Question 1 (Steffi) : Can the vanishing gradient problem be solved by initializing the network with unsupervised pretraining?
- Supplements (Steffi): X. Glorot, Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks." International conference on artificial intelligence and statistics, pp. 249-256, 2010.
- Answer(Daniel): I think yes. We can solve the vanishing gradient problem by initializing the network with unsupervised pre-training. If we can pre-training before we train the deep network, the variance of the gradient will be normalized. On the other hand, the gradient will be similar. So I think it can solve the vanishing gradient problem.
- Question 2 (Kirito) : Could you give me an example or method to make deep training a lot easier?
- Answer(Daniel) : The method that can solve the vanishing gradient function is so many. We can use unsupervised pre-training to initializing our neural network. We also can change the activation function.
- Question 3 (Tommy) : How to solve the vanishing gradient problem?
- Answer (Daniel): To solve the vanishing gradient, we must balance the gradient (or speed of changing) of the different hidden layer. Because the gradient related to the derivative of activation function and weight. So we can change the activation function like ReLu to solve it.
- Question 4 (Lolly) : Is it possible that different hidden layers have different number of neurons?
- Answer(Daniel) : Yes. The number of neurons in the hidden layer depends on what the problem that you want to solve is. So,it's possible.
- Question 1 (Steffi) : Can the vanishing gradient problem be solved by initializing the network with unsupervised pretraining?
- Ch 6: Deep learning Presenters: Tommy, Lolly
- Question 1 (Steffi) : It is possible to produce images that are totally unrecognizable to humans, but the network classifies as being known in a category of familiar objects. What about the generality of deep neural networks?
- Supplements (Steffi): Nguyen A, Yosinski J, Clune J. "Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images." In Computer Vision and Pattern Recognition, IEEE, 2015. arXiv:1412.1897v4.
- Answer (Tommy, late reply, 6/8): Deep learning is based on human. So that I think it is impossible the network can classifies the image which human unrecognizable.
- Question 2 (Kirito) : Why the last time of output connect to the next time of input in the recurrent neural networks?
- Answer (Lolly, late reply, 6/8): Connecting the last time of output to the next time of input let the recurrent neural networks have correlation and have memory.
- Question 3 (Daniel) : Why need to use pooling layers?
- Answer (Tommy, late reply, 6/8): Because we need to subsampling, so that we use the pooling layer.
- Question 4 (Spencer) : Can we do better than these results using a deeper network architecture?
- Answer (Lolly, late reply, 6/8): Yes. Applying dropout can get the better result. Accuracy is 99.6%. (these result:Using shallow architecture,single hidden layer,to obtain the accuracy is 97.8%)
- Question 1 (Steffi) : It is possible to produce images that are totally unrecognizable to humans, but the network classifies as being known in a category of familiar objects. What about the generality of deep neural networks?
- The presenters are requested to answer corresponding questions before 6/1.
- Ch 4: A visual proof that neural nets can compute any function Presenter: Kirito
- 5/24 : Neural Networks and Deep Learning, Michael Nielsen, 2015.
- Ch 1: Using neural nets to recognize handwritten digits Presenter: Jacky
- Question 1 (Spencer): Can you provide interpretation of what gradient desent is doing in one dimensional case?
- Answer(Jacky): Let’s use a graph to explain this, a parabola opening upward, let x-axis be the weight and y-axis be the cost. Gradient descent start from randomly choosing a point, what gradient descent do is to find the next cost by computing ΔC, using a equation C’ = C + ΔC, until it finds the lowest point of the parabola.
- Question 2 (Cindy): How does the learning rate influence learning?
- Answer(Jacky): The learning rate can’t be too large, also it can’t be too small
when the learning rate is too small, it’ll take a long time, because the distance it moves is very small, and when there is a local minimum, it’ll probably just stop on the local minimum, which is not the best cost we want. When the learning rare is too large, it’ll be faster than small learning rate, also it could jump over the local minimum, so it won’t stop over there, but a problem is that, if the cost is already near to the global minimum, it could jump over it to the next side, and it’ll never get to the smallest point because it is jumping in a long distance overtime.
- Answer(Jacky): The learning rate can’t be too large, also it can’t be too small
- Question 3 (Karissa): Why using quadratic cost?
- Answer(Jacky): The number that are classified correctly is not a smooth function of the weights and biases, small change to the weights and biases won’t cause any change to the training data that are classified correctly, so it’s difficult to figure out how to change the weights and biases to get better performance. We use quadratic cost is because it is a smooth cost function, it’s much easier to figure out how to change weights and biases to improve the performance.
- Question 4 (Dioxin): How does stochastic gradient descent work to speed up the process?
- Answer(Jacky): Stochastic gradient descent works by estimating the 𝝯C by randomly pick up the training input, then compute the gradient for every chosen input, then average all the gradient computed we can get the 𝝯C we want. It is quickly estimated also it’s moving in a general direction to the minimum point.
- Question 1 (Spencer): Can you provide interpretation of what gradient desent is doing in one dimensional case?
- Ch 2: How the backpropagation algorithm works Presenter: Cindy
- Question 1 (Spencer): What situation will make neuron learn slowly?
- Answer(Cindy): There is two main case. First, If the ain in the fourth fundamental equation is small, that approach to zero, and the gradient partial derivatives C will also tend to small. Then the weight didn’t change much during gradient descent. It also can say that the weight output is from low activation neuron, so it learn slowly. The second, when σ(z) is approach 0 or 1, and the σ'(z) approach 0, that the weight in final layer will learn slowly, it can say the output neuron has saturated, and the weight is learning slowly.
- Question 2 (Jacky): Why use the matrix-based form in the fundamental equation?
- Answer(Cindy): Why we don’t just use the initial function. And we need one more step to transport to the matrix-based form, because in this expression has a nice vector form is easily and convenence to computed.
- Question 3 (Karissa): Why don't use j as the input neuron, and k as the output neuron in the wjk?
- Answer(Cindy): If we use j as the input neuron, and k as the output neuron in the wjk. It is really more make sense for us, that j is in front of k usually means input. But in backpropagation, we will have more step to do that to replace the weight matrix in the equation by the transpose of the weight matrix.
- Question 4 (Dioxin): Why the algorithm is called backpropagation?
- Answer(Cindy): The goal of backpropagation is to compute the partial derivatives ∂C/∂w and ∂C/∂b, and it call “back”propagation is because that it compute the error vector backward started from the final layer.
- Question 1 (Spencer): What situation will make neuron learn slowly?
- Ch 3: Improving the way neural networks learn Presenters: Karissa, Dioxin
- Question 1A (Spencer): When should we use the cross-entropy instead of the quadratic cost?
- Answer(Karissa): If you use a quadratic cost fuction to do some learning and find it was not good and learning slow down than you can use cross-entropy to solve the problem.
- Question 1A (Spencer): Why regularization can suppressed overfitting?
- Answer(Dioxin): It minimize and make all the weights average to decrease the influence of some critical error point and decrease the affect of some bad feature that network has learned.
- Question 2 (Jacky): Why use λ/2n as second terms of regularization function? Eq. (85)
- Answer(Dioxin): The 2n is use to normalize the sum of the weights, and λ is choose by the designer of the network which result he prefer. With small λ we will minimize the original cost function, while a large λ will resulted in small weights.
- Question 3 (Cindy): Why there is a negative sign in the formula 57? And why you mark it?
- Answer(Karissa): We want to choose a cost function which could satisfy by the quadratic cost function's propertises. Beside, it can aviod the problem learning slow down. So, put a minus sign in front can make the cross-entry be always non-negative.
- Question 4 (Jonathan): In what way does regularization solve over-fitting? Please elaborate.
- Answer(Dioxin): Regularization will solve the overfitting network with less training data, it average the weights so bad feature won't have a significant effect.
- Question 1A (Spencer): When should we use the cross-entropy instead of the quadratic cost?
- The presenters are requested to answer corresponding questions before 5/25.
- Ch 1: Using neural nets to recognize handwritten digits Presenter: Jacky
- 5/17 : A Tutorial on Deep Learning, Quoc V. Le (Research Scientist, Google), 2015/10. (video)
- Part 1: Nonlinear Classifiers and The Backpropagation Algorithm. Presenter: Jonathan
- Question 1 (Steffi) Why Rectified Linear units is used as activation function in place of sigmoid function?
- Answer(Jonathan): It has been observed in multiple studies that using the ReLU as the activation function has a better performance in comparison to using the sigmoid function. While the reason for that in still an open question, it is believed that the reason lies in the derivative of the ReLU function. The ReLU function's derivative has more non-zero elements, thus lowering the possibilities of causing the vanishing gradient problem.
- Reference: Chapter 5 Why are deep neural networks hard to train?, in "Neural Networks and Deep Learning," by Michael Nielsen, 2015.
- Question 2 (Spencer) What would happen did not using stochastic gradient descent to minimize a function?
- Answer(Jonathan): The training of a neural network depends largely on the gradient decent algorithm. As of today, even though there are multiple ways of training a neural network, most of them are variants of the gradient descent algorithm. Back to your question, if we didn't use stochastic gradient descent, we could use other training methods, but if we stopped using any gradient descent algorithm, we'd basically would end up with a randomly generated network.
- Reference: Chapter 3 Improving the way neural networks learn, in "Neural Networks and Deep Learning," by Michael Nielsen, 2015.
- Question 1 (Steffi) Why Rectified Linear units is used as activation function in place of sigmoid function?
- Part 2: Autoencoders, Convolutional Neural Networks and Recurrent Neural Networks . Presenter: Steffi
- Question 1 (Jonathan) What is the difference between Google DistBelief and TensorFlow?
- Answer(Steffi): Tensor flow is a 2nd generation system. It acts as an interface for expressing machine learning algorithms and implementation for executing such algorithms. whereas, Google DistBelief is the first generation scalable distributed training and inference system.Many companies have deployed Deep neural networks using Distbelief in a variety of products that we use in our day to day life including google search, google photos, google maps and street view, google translate, youtube and many others.
- Reference: TensorFlow: Large-scale machine learning on heterogeneous systems, ArXiv, 2016.
- Question 2 (Spencer) How can we do something to reduce network traffic problem ?
- Answer(Steffi): Yes. Autoencoders can be used to reduce the network traffic. Autoencoders encode the input send it to the cloud and decode or reconstruct the input. It is applicable to high dimensional inputs as well which reduces the network traffic.
- Question 1 (Jonathan) What is the difference between Google DistBelief and TensorFlow?
- The presenters are requested to answer corresponding questions before 5/18.
- Part 1: Nonlinear Classifiers and The Backpropagation Algorithm. Presenter: Jonathan