It is usually said and believed that men are taller than women. From what we see in society, this is usually the case. When humans are at the point before they develop, heights could be similar but as they grow up, adult males tend to grow more than adult females. Males tend to be taller than females because they undergo a late growth spurt and grow faster than females. Though adult males are generally taller than adult females, there are always a few rare exceptions. This makes people wonder sometimes, how rare can these exceptions be.
I am interested in studying the probability of a person being male or female given their height. The data I will be using is collected from an article published in 2005 called “Stature Estimation Based on Hand Length and Foot Length” in the Clinical Anatomy journal where the sample is collected from 155 adult Turks, (80 male, 75 female, ages 17-23). I will be using logistic regression to model to probability whether a person is male (0) or female (1) based on height measured in millimeters by using the following simple logistic regression model,
The maximum likelihood estimates model obtained in R are, Β0 = 78.425 and Β1 = -0.047 meaning that
Based on this formula, the predicted probability of the subject being female decreases as the subject’s height increases. The following plot below is a plot of the fitted model which shows that the predicted probability of the subject being female decreases as height increases:
The analysis of association is done with the Hosmer and Lemeshow Chi-Square test to test the goodness of fit. The test statistic is 4.439 with 8 degrees of freedom and a p-value of 0.8155. Therefore, we can’t reject the null hypothesis that the model is fitting the data and that the model is probably a good fit.
For the test that the null hypothesis that Β0 = 0, the Wald Z-statistic is 5.909 and the p-value is less than 0.001 and so we reject the null hypothesis and conclude that Β0 does not = 0 at alpha = 0.05. This means that the response and explanatory variables are not independent of each other. The 95% confidence interval for the intercept is [52.41, 104.43] which supports the conclusion for the hypothesis testing for Β0. For the test that the null hypothesis that Β1 = -0.047, the Wald Z-statistic is -5.903. the p-value is also less than 0.001 meaning that we can reject the null hypothesis and conclude that Β1 does not = 0 at alpha = 0.05. The 95% confidence interval for the slope is [-0.063, -0.031] which supports the conclusion for the hypothesis testing for Β1. It seems that both the intercept and the effect of height are significant.
The 95% confidence interval for the odds ratio is the confidence interval for the slope, expotentiated which means that it’s [0.94 to 0.97] meaning that we are 95% confident that for each additional millimeter in height, the odds that the subject is female is 0.94 to 0.97 times as much than the original height.
Below is a graph of the standardized Pearson residuals. The residuals does not seem to have unusually huge values and they all seem to be between -4 to 4 so there doesn’t seem to be any problems.
The observed median effect level occurs when the height = 1668.6 mm which means that the probability that the subject being male or female has an equal chance of happening at that height.
The model that was found with the simple logistic regression had a significant intercept and a significant slope, meaning that the effect of the subject’s height is significant. The predicted values for the probability of the subject being female decreases as height increases as seen in the graph for the fitted model. The model seems to be a good fit based on the results of the Hosmer-Lemeshow test. There could be reverse causality in the data where height is dependent on gender that may affect the results. Additionally, there may be other factors in the human body such as weight, that might affect the outcomes of the study. When adding other factor to the data, different results could possibly be generated.