Use for each point a color corresponding to the survived outcome. Plot the estimated probabilities as a function of age (impose that the probability axis is between 0 and 1). Finally compute the odds and the odds-ratio.Ĭompute the probability of surviving for all the people in the test data set (use predict). Repeat the same for a person of 20 years old. Remember that estimated coefficients can be retrieved by mod1$coefficients. Provide the model summary.Ĭompute manually (without using predict) by means of the formula of the logistic function the probability of surviving for a person of 50 years old. Comment the plot.Ĭonsider the training data set, estimate a logistic model for survived considering age as the only predictor ( mod1). Moreover, represent graphically the relationship between age and survived. Compute the percentage of survived people. After this, check that you don’t have any other missing values.Ĭonsider the training data and survived as response variable. Note that for computing the mean when you have missing values you have to run mean(.,na.rm=T). Separately for the training and test data, substitute the missing values with the average of age and fare.In which class do you observe the highest proportion of survivors?Ĭheck how many missing values we have in the variables age and fare separately for training and test data. Transform it in a factor type object using the factor function with categories “No” and “Yes”.Ĭonvert pclass (ticket class) to factor (1 = “1st”, 2 = “2nd”, 3 = “3rd”) for both the training and test data set. Note that the response variable is survived which is classified as a int 0/1 variable in the dataframe. embarked: port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).parch number of parents / children aboard the Titanic.sibsp: number of siblings / spouses aboard the Titanic.
pclass: ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).In particular the following variables are available: The data contained in the files titanic_tr.csv (for training) and titanic_te.csv (and testing) are about the Titanic disaster (the files are available in the e-learning). For a simpler interpretation, in the odds scale, we can take the exponential transformation of the parameter: Consider for example the parameter of the Pregnancies variable which is positive (i.e. higher risk of diabetes) and equal to 0.1193692: this means that for a one-unit increase in the number of pregnancies, the log-odds increases by 0.1193692. The summary contains the parameter estimates and the corresponding p-values of the test checking \(H_0:\beta =0\) vs \(H_1: \beta\neq 0\). means that all the covariates but Outcome are included as regressors (this avoids to write the formula in the standard way: Outcome ~ Pregnancies+Glucose+.). # Residual deviance: 519.01 on 528 degrees of freedom # Null deviance: 703.68 on 536 degrees of freedom # (Dispersion parameter for binomial family taken to be 1) , data = tr, family = "binomial") summary(logreg) # 8.2 Linear regression model with step functions.7.4 SVM with more than 2 categories and more than 2 regressors.6.3 Gradient boosting: parameter tuning.6.1 Linear model, bagging and random forest.5.5 Model comparison by using the ROC curves.5.1 Credit data and classification tree.4.6 Changing the number of bagged trees.4.2 Another method for creating the training and testing set.3.7 Construction and plotting of the ROC curve.3.6 Classifiers comparison in terms of performance indexes.
2.1.4 Comparison of KNN with the multiple linear model.2.1.3 Implementation of KNN regression with different values of \(k\).2.1.2 Implementation of KNN regression with \(k=1\).2.1.1 Creation of the training and testing set: method 1.