Year Lag1 Lag2 Lag3
Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
Lag4 Lag5 Volume Today
Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
Direction
Down:602
Up :648
# Create a dataframe for data browsingsm=Smarket# Bivariate Plot of inter-lag correlationspairs(Smarket,col=Smarket$Direction,cex=.5, pch=20)
Direction
glm.pred Down Up
Down 145 141
Up 457 507
mean(glm.pred==Direction)
[1] 0.5216
# Make training and test set for predictiontrain = Year<2005glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family=binomial, subset=train)glm.probs=predict(glm.fit,newdata=Smarket[!train,],type="response") glm.pred=ifelse(glm.probs >0.5,"Up","Down")Direction.2005=Smarket$Direction[!train]table(glm.pred,Direction.2005)
Direction.2005
glm.pred Down Up
Down 77 97
Up 34 44
Direction.2005
glm.pred Down Up
Down 35 35
Up 76 106
mean(glm.pred==Direction.2005)
[1] 0.5595238
# Check accuracy rate106/(76+106)
[1] 0.5824176
# Can you interpret the results?
Assignment Questions
a. What is/are the requirement(s) of LDA?
Linear Discriminant Analysis is a classification system. Though it is sometimes referred to as LDA regression, like Logistic “regression” it is less of a regression modeling and more of a system of classification. The requirements are of LDA are to use it on non-quantitative outcome variables. OLS expects linear continuous relationships, but something in between categories is not a reality. The example in the ISLR Chapter 4 illustrates this well when the outcomes are seizure, stroke, and drug overdose. Here regression cannot understand different categories that are not order or binary.
b. How LDA is different from Logistic Regression?
LDA models the distribution of predictors separately for each of the category classes, rather than considering them as a conditional distribution as is done with logistic regression. It also requires an additional step once this is done which relies on Bayes’ theorem to fliip them into estimates. Logistic regression and LDA can produce very similar results if the distribution of the dependent variable fits normality assumptions. This is of course not always the case though, so LDA is useful when the separation between classes is believed to be large, if X distribution is assumed to be normal but the sample is small, and if there are more than two response classes for a categorical variable.
c. What is ROC?
ROC is a curve visualization that graphs two types of errors: the sensitivity and specificity (positive and false positive rate, respectively). We interpret a ROC but examining the area underneath the curve. The closer to 1, the better we consider the model. We can use these to compare models (i.e. logistic vs. LDA) to determine which is better.
d. What is sensitivity and specificity? Which is more important in your opinion?
Sensitivity over specificity, due to the importance of accurately diagnosing ill people rather than inaccurately diagnosing healthy people.
e. From the following chart, for the purpose of prediction, which is more critical?
If the above considers diagnostics testing as it does in the ISLR text, then I would say that predicting the true positive rate is the most important. This is the outcome with the most impact on humans and false negatives can be caught with other follow up diagnostics. Therefore, sensitivity over specificity is most important.
f. Calculate the prediction error from the following:
Error calculation: total number of incorrect predictions over total number of predictions.
(23+252)/10,000 = 0.0275
This is pretty good, considering the error rate ranges from 0 to 1.