Introduction
- The Data
  - Prepare packages and workspace
  - Upload Data
Data Cleaning and Preprocessing :
Data Analysis :
- Algunas visualizaciones
Linear Regression Model
- Model Creation
- Model Fit

Introduction

In today’s world, understanding and predicting mortality rates in different countries is of paramount importance for public health officials, policymakers, and researchers. Mortality rates are not only indicators of a country’s health status but also reflect the effectiveness of its healthcare systems, socio-economic conditions, and various other factors.

The global COVID-19 pandemic has underscored the critical need for accurate mortality rate predictions to effectively allocate resources, implement preventive measures, and devise policies aimed at minimizing the impact of future health crises. However, mortality rates are influenced by a multitude of complex factors, including demographics, healthcare infrastructure, lifestyle choices, and environmental conditions.

In this context, the development of regression models to predict mortality rates in countries becomes a significant endeavor. By leveraging historical data and incorporating relevant predictors, such as population demographics, economic indicators, and healthcare expenditures, we can strive to create robust regression models capable of providing valuable insights into mortality trends and facilitating proactive interventions.

The ultimate goal of this research is to build a regression model that accurately predicts mortality rates in different countries. Such a model could empower decision-makers with actionable insights to enhance public health policies, allocate resources efficiently, and ultimately improve the well-being and longevity of populations worldwide.

The Data

Dataset Overview: Adult Mortality Rate (2019-2021) The dataset provides comprehensive information on adult mortality rates across different countries from 2019 to 2021. Below is a summary table detailing key variables and their descriptions:

Adult mortality rate (2019-2021).csv

Variable	Description
Countries	Country of study.
Continent	Continent location of the country.
Average_Population	Average population of the country under study for 2019-2021 in thousands.
Average_GDP	Average GDP of the country under study for 2019-2021 in millions of dollars.
Average_GDP_per_capita	Average GDP per capita of the country under study for 2019-2021 in dollars.
Average_HEXP	Health Expenditure Per Capita in the country under study in dollars.
Development_level	Level of development of the state under study (calculated by GDP per capita of the country). Please note that in this dataset we calculate this indicator only by calculating GDP per capita! Despite the fact that the United Nations (UN) does not have an unambiguous classification of countries into developed, developing and backward based on only one indicator, such as the amount of GDP per capita. It uses a wider range of economic, social and quality indicators to determine the level of development of countries.
AMR_female	Average mortality of adult women in the country under study (per 1000 adult women per year) for 2019-2023.
AMR_male	Average mortality of adult men in the country under study (per 1000 adult men per year) for 2019-2023.
Average_CDR	Average crude mortality rate for 2019–2021 in the country under study.

Prepare packages and workspace

Upload Data

`Adult_mortality` <- read.csv("C:/Users/gabri/Proyectos R/Adult Mortality Rate/Adult mortality rate (2019-2021).csv", stringsAsFactors=TRUE)

Data Cleaning and Preprocessing :

Quick View of Data and Structure :

#View of Structure
glimpse(Adult_mortality)

## Rows: 156
## Columns: 10
## $ Countries                          <fct> Afghanistan, Albania, Algeria, Ango…
## $ Continent                          <fct> Asia, Europe, Africa, Africa, North…
## $ Average_Pop.thousands.people.      <dbl> 38947.06, 2834.57, 43445.00, 33428.…
## $ Average_GDP.M..                    <dbl> 17995.64, 16263.16, 160325.54, 7214…
## $ Average_GDP_per_capita...          <dbl> 462.05, 5737.44, 3690.31, 2158.28, …
## $ Average_HEXP...                    <dbl> 80.53, 420.17, 215.53, 61.29, 868.6…
## $ Development_level                  <fct> Short, Short, Short, Short, Average…
## $ AMR_female.per_1000_female_adults. <dbl> 204.85, 53.54, 70.85, 218.03, 58.76…
## $ AMR_male.per_1000_male_adults.     <dbl> 322.17, 103.55, 102.21, 316.81, 103…
## $ Average_CDR                        <dbl> 7.08, 10.20, 4.78, 7.88, 6.06, 8.35…

Our dataset comprises a total of 156 observations and 10 columns, featuring variables of only two types :

Fct (Factor) : Comprising 3 variables
Dbl (Double) : Comprising 7 variables

Handling Missing Values :

#Check Missing Values
colSums(is.na(Adult_mortality))

##                          Countries                          Continent 
##                                  0                                  0 
##      Average_Pop.thousands.people.                    Average_GDP.M.. 
##                                  0                                  0 
##          Average_GDP_per_capita...                    Average_HEXP... 
##                                  0                                  0 
##                  Development_level AMR_female.per_1000_female_adults. 
##                                  0                                  0 
##     AMR_male.per_1000_male_adults.                        Average_CDR 
##                                  0                                  0

Great! We have 0 missing values in this Dataset

Handling Outliers :

#Quick view of Stadistical infomration about numerics variables
describe(
  Adult_mortality %>%
    select_if(is.numeric)
  
)

# Boxplots for better visualization of statistical information
ggplot(Adult_mortality, aes(x = "", y = Average_Pop.thousands.people.)) +
  geom_boxplot() +
  labs(title = "Boxplot of Average Population (thousands of people)",
       x = NULL,
       y = "Average Population (thousands)")

ggplot(Adult_mortality, aes(x = "", y = Average_GDP.M..)) +
  geom_boxplot() +
  labs(title = "Boxplot of Average GDP (in millions of dollars)",
       x = NULL,
       y = "Average GDP (M$)")

ggplot(Adult_mortality, aes(x = "", y = Average_GDP_per_capita...)) +
  geom_boxplot() +
  labs(title = "Boxplot of Average GDP per Capita",
       x = NULL,
       y = "Average GDP per Capita")

ggplot(Adult_mortality, aes(x = "", y = Average_HEXP...)) +
  geom_boxplot() +
  labs(title = "Boxplot of Average Health Expenditure Per Capita",
       x = NULL,
       y = "Average Health Expenditure Per Capita ($)")

ggplot(Adult_mortality, aes(x = "", y = AMR_female.per_1000_female_adults.)) +
  geom_boxplot() +
  labs(title = "Boxplot of AMR (per 1000 Female Adults)",
       x = NULL,
       y = "AMR (per 1000 Female Adults)")

ggplot(Adult_mortality, aes(x = "", y = AMR_male.per_1000_male_adults.)) +
  geom_boxplot() +
  labs(title = "Boxplot of AMR (per 1000 Male Adults)",
       x = NULL,
       y = "AMR (per 1000 Male Adults)")

ggplot(Adult_mortality, aes(x = "", y = Average_CDR)) +
  geom_boxplot() +
  labs(title = "Boxplot of Average Crude Mortality Rate",
       x = NULL,
       y = "Average Crude Mortality Rate")

In most cases, we observe two observations with very high values across all variables. These observations correspond to China and India, which, due to their immense population, tend to skew the data. However, it’s essential to retain these countries for our analysis and linear regression model, as their data is relevant and representative of the global population landscape.

Data Analysis :

Algunas visualizaciones

Observemos algunas visualizaciones :

# Bar plot for Average GDP per capita grouped by Continent
ggplot(Adult_mortality, aes(x = reorder(Continent, Average_GDP_per_capita...), 
                            y = Average_GDP_per_capita..., 
                            fill = Continent)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  coord_flip() +
  labs(title = "Average GDP per Capita by Continent",
       x = "Continent",
       y = "Average GDP per Capita")

# Bar plot for Average Population grouped by Continent
ggplot(Adult_mortality, aes(x = reorder(Continent, Average_Pop.thousands.people.), 
                            y = Average_Pop.thousands.people., 
                            fill = Continent)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  coord_flip() +
  labs(title = "Average Population by Continent",
       x = "Continent",
       y = "Average Population (thousands)")

# Density plot for Health Expenditure Per Capita grouped by Development Level
ggplot(Adult_mortality, aes(x = Average_HEXP..., fill = Development_level)) +
  geom_density(alpha = 0.6) +
  labs(title = "Health Expenditure Per Capita Density by Development Level",
       x = "Health Expenditure Per Capita",
       fill = "Development Level")

# Density plot for Adult Mortality Rate (Female) grouped by Development Level
ggplot(Adult_mortality, aes(x = AMR_female.per_1000_female_adults., fill = Development_level)) +
  geom_density(alpha = 0.6) +
  labs(title = "Adult Mortality Rate (Female) Density by Development Level",
       x = "AMR Female (per 1000 Female Adults)",
       fill = "Development Level")

# Density plot for Adult Mortality Rate (Male) grouped by Development Level
ggplot(Adult_mortality, aes(x = AMR_male.per_1000_male_adults., fill = Development_level)) +
  geom_density(alpha = 0.6) +
  labs(title = "Adult Mortality Rate (Male) Density by Development Level",
       x = "AMR Male (per 1000 Male Adults)",
       fill = "Development Level")

# Boxplot of Avg Crude Mortality by Continent
ggplot(Adult_mortality, aes(reorder(Continent, Average_CDR), Average_CDR, color = Continent)) +
  geom_boxplot(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Boxplot of Average Crude Mortality Rate by Continent",
       x = "Continent",
       y = "Average Crude Mortality Rate")

# Boxplot of Avg Crude Mortality by Continent
ggplot(Adult_mortality, aes(reorder(Development_level, Average_CDR), Average_CDR, color = Development_level)) +
  geom_boxplot(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Boxplot of Average Crude Mortality Rate by Development_level",
       x = "Development_level",
       y = "Average Crude Mortality Rate")

#Correlation between Variables

# Calculate correlation matrix
correlation <- cor(
  Adult_mortality %>%
    select_if(is.numeric)
)

# Melt correlation matrix for plotting
correlation_data <- melt(correlation)

# Plot correlation heatmap
ggplot(correlation_data, aes(Var1, Var2, fill = value))+
  geom_tile()+
  scale_fill_gradient2(low = "lightblue", high = "#F94C10", mid = "#F8DE22", 
                       midpoint = 0, limit = c(-1,1), 
                       name = "Correlation") +
  theme_minimal() +
  geom_text(aes(label = round(value, 2)))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Heatmap", x = "", y = "")

Linear Regression Model

Model Creation

# Create linear regression model
Regresion_model <- lm(Average_CDR ~  Average_GDP.M.. + AMR_male.per_1000_male_adults.+ Development_level+ Continent, data = Adult_mortality)

# Display model summary
summary(Regresion_model)

## 
## Call:
## lm(formula = Average_CDR ~ Average_GDP.M.. + AMR_male.per_1000_male_adults. + 
##     Development_level + Continent, data = Adult_mortality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0593 -1.0887 -0.2009  1.0670  6.6111 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.652e+00  8.025e-01   5.796 4.02e-08 ***
## Average_GDP.M..                 2.883e-07  1.188e-07   2.426   0.0165 *  
## AMR_male.per_1000_male_adults.  1.306e-02  2.431e-03   5.371 3.03e-07 ***
## Development_levelHigh          -1.464e+00  5.963e-01  -2.455   0.0153 *  
## Development_levelShort         -4.363e-01  4.830e-01  -0.903   0.3679    
## ContinentAsia                  -7.679e-01  5.569e-01  -1.379   0.1701    
## ContinentAustralia             -3.240e-01  6.573e-01  -0.493   0.6229    
## ContinentEurope                 5.459e+00  6.050e-01   9.022 9.75e-16 ***
## ContinentNorth America          5.286e-01  6.076e-01   0.870   0.3857    
## ContinentSouth America          7.510e-01  7.364e-01   1.020   0.3095    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.961 on 146 degrees of freedom
## Multiple R-squared:  0.5847, Adjusted R-squared:  0.5591 
## F-statistic: 22.84 on 9 and 146 DF,  p-value: < 2.2e-16

Model Fit

Now let’s check the validity of our model:

# Add predictions and residuals to the data
Ajust <- Adult_mortality %>%
  add_predictions(Regresion_model) %>%
  add_residuals(Regresion_model)

# Plot diagnostic plots
par(mfrow = c(2,2))
plot(Regresion_model)

Checking the normality of the model:

# Jarque-Bera test for normality
jarque.bera.test(Ajust$resid)

## 
##  Jarque Bera Test
## 
## data:  Ajust$resid
## X-squared = 15.171, df = 2, p-value = 0.0005078

# Summary of residuals
summary(Ajust$resid)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -6.0593 -1.0887 -0.2009  0.0000  1.0670  6.6111

Checking the Heteroscedasticity of the model :

# Breusch-Pagan test for heteroscedasticity
bptest(Regresion_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  Regresion_model
## BP = 15.796, df = 9, p-value = 0.07127

# Heteroscedasticity-robust coefficient test
coeftest(Regresion_model, vcov = vcovHC(Regresion_model, type = "HC2"))

## 
## t test of coefficients:
## 
##                                   Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept)                     4.6515e+00  7.7820e-01  5.9773 1.663e-08 ***
## Average_GDP.M..                 2.8833e-07  3.4669e-07  0.8316   0.40697    
## AMR_male.per_1000_male_adults.  1.3058e-02  2.5455e-03  5.1297 9.086e-07 ***
## Development_levelHigh          -1.4640e+00  7.7829e-01 -1.8810   0.06196 .  
## Development_levelShort         -4.3627e-01  5.8916e-01 -0.7405   0.46019    
## ContinentAsia                  -7.6787e-01  4.4690e-01 -1.7182   0.08788 .  
## ContinentAustralia             -3.2395e-01  5.4732e-01 -0.5919   0.55484    
## ContinentEurope                 5.4589e+00  7.7925e-01  7.0053 8.383e-11 ***
## ContinentNorth America          5.2863e-01  5.5418e-01  0.9539   0.34171    
## ContinentSouth America          7.5097e-01  5.2250e-01  1.4373   0.15278    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adult Mortality rate (2019-2021)

Gabriel Ordoñez

2024-04-01