A model with two or more explanatory variables is a multiple linear regression. This term is distinct from multivariate linear regression, which predicts multiple correlated dependent variables rather than a single dependent variable.
\[\mathit f_{w,b}(x) = w_{1} x_{1} + w_{2} x_{2} + ... + w_{n} x_{n} + b_{0}\]vectors:
\[\vec{w} = [w_{1} w_{2} w_{3} ... w_{n} ]\] \[\vec{x} = [x_{1} x_{2} x_{3} ... x_{n} ]\] \[\mathit f_{\vec{w},b}(\vec{x}) = \vec{w} . \vec{x} + b = w_{1} x_{1} + w_{2} x_{2} + ... + w_{n} x_{n} + b_{0}\]$\vec{x}_{1} ^ 4$ refers to the first feature (first column in the table) of the fourth training example (fourth row in the table)
Mathematical representation without vectorisation
\[\mathit f_{\vec{w},b}(\vec{x}) = \left( \sum_{j=1}^n \quad w_{j} x_{j} \right) + b\]Pyhton code without vectorisation :
f = 0
for j in range (n):
f = f +w[j] * x[j]
f= f+b
</br>
vectorisation :
\[\mathit f_{\vec{w},b}(\vec{x}) = \vec{w} . \vec{x} + b\]Pyhton code usign Numpy DOT functioın:
f=np.dot(w,x)+b
</br>
Numpy DOT functioın uses parallel calculation to accelarate the procedures both on CPU or GPU.
Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
when you have different features that take on very different ranges of values, it can cause gradient descent to run slowly but re scaling the different features so they all take on comparable range of values. because speed, upgrade and dissent significantly.
How to:
Learning Curve: Assume a grapgh with J(w,b)/loss function/orcross-validation score etc on Y axis and #iterations on X axis.
On each iteration the J(w,b)=total cost is expected to decrease materially.
After a many trial/iterations if the decrease becomes immaterial or flat… then it means you need to change the learning rate.
Automatic convergence test:
Let Ɛ (epsilon) be 0.001. If J(w,b) decreases by <=Ɛ in one iteration, declare convergence. (get closer to global minimum)
With a small enough α, J(w,b) should decrease on every iteration. 1st select and really small learning rate and check if the cost decreases– otherwise there might be a coding bug. But if it is too small, #iteratiosn increases, takes longer time, consumes more source. Try…
Using intuition to design new features, by transforming or combining original features. (ie.. x1=length, x2=width, then x3 might be=area)
Polynomials: Can be generated solely by addition, multiplication, and raising to the power of a positive integer.