Data sourcesOur information was collected from a 2017 earthquake survey that was performed on-line in Oklahoma using Qualtrics. The information is a part of the analysis by Ng’ombe and Boyer34 which established how a lot of the earthquake harm the oil and fuel business must be chargeable for, by way of what earthquake victims endure. This research focuses on the factors related to the resolution by Oklahoma residents to buy house/property insurance in opposition to earthquakes in the state. The information have been collected from the U.S. state of Oklahoma and its counties that skilled an increase in earthquakes linked to fracking. To administer the survey, it was first permitted by the Oklahoma State Institutional Review Board (IRB), after which performed by way of Survey Sampling International (SSI), a good group that employs quite a few strategies to recruit respondents. The survey was performed with 1,153 people by way of SSI, of which 813 of them efficiently accomplished the survey. According to Ng’ombe and Boyer34, SSI has panels of potential respondents that reply to on-line surveys for a given charge. A sequence of questions on individuals’s attitudes towards earthquakes in the state and their demographic, and socio-economic traits have been requested in the survey. The variables used on this research are described in Table 1. Columns 1 and a pair of respectively embody variable names and their definitions. The response variable is earthquake insurance. It represents individuals who responded Yes or No to a query that requested them whether or not they insured their residence or property in opposition to earthquake-related harm.Table 1 Variable definitions.Independent variables are labeled into two units. The first set of impartial variables includes socio-demographics comparable to age, gender, and race of the survey respondents. The second set includes a spread of variables referring to respondents’ attitudes towards earthquakes in Oklahoma. Table 2 exhibits descriptive statistics for these variables based mostly on to the complete pattern, with and with out earthquake insurance. The t-test statistics for the imply variations of the variables between the two teams of respondents are introduced in the final column of Table 2. The complete pattern measurement is 812 with 14.4% (114 respondents) reporting that their property/residence was insured in opposition to earthquake harm. In phrases of the imply variations of variables by insurance uptake standing, we are able to see that the pattern was not purely homogenous of their responses as most of the variable means have been statistically totally different from zero between the two teams. For instance, the imply age distinction between individuals with and with out insurance was 1.01 years, which is statistically totally different from zero – these with insurance have been older. The same commentary may be made for the quite a few variables thought-about.Table 2 Descriptive statistics of chosen variables.Independent variablesThe variables included in all our fashions are proven in Table 1. They are based mostly on a overview of literature on individuals’s attitudes towards earthquakes [e.g., 11, 25, 34, 36, 42–52]. We included socio-demographic variables as a result of they’re anticipated to affect individuals’s resolution to have earthquake insurance in addition to to assist predict which people would purchase earthquake insurance. in accordance with present literature, we anticipate that feminine respondents are extra doubtless than males to personal earthquake insurance as a result of females are warier of environmental threats and calamities11,41. Regarding demographic factors comparable to race, ethnicity, training, and earnings, we anticipate a heterogeneous relationship with the chance of getting earthquake insurance. For instance, Ansolabehere and Konisky42 discovered that minority teams are extra skeptical and anxious about coal and pure fuel crops situated close to their properties, which we imagine would encourage them to buy earthquake harm insurance. In a separate research, Boudet et al.43 noticed that politically conformist respondents had a minimum of an undergraduate diploma, and older respondents have been extra prone to assist hydraulic fracturing in the United States. Also, Ng’ombe and Boyer34 discovered various ranges of earthquake-related legal responsibility ranges that respondents would assign to fracking companies in Oklahoma for any harm, which suggests that related various outcomes could possibly be anticipated concerning their resolution to buy earthquake insurance.Moreover, we anticipate respondents who’ve lived in Oklahoma for an extended interval to extra doubtless personal earthquake insurance as a result of they might have extra earlier expertise with earthquakes, particularly since 2009, when the state began to expertise quite a few earthquakes10. As in Ng’ombe and Boyer34, we anticipate that renters can be much less prone to have earthquake harm insurance than property house owners. By political affiliation, we anticipate that Democrats (Republicans) are extra (much less) prone to have earthquake insurance as a result of they’ve proven higher (lesser) willingness to just accept fracking associated advantages for regulation11. Boudet et al.43 and Davis and Fisk44 noticed that Democrats are extra dedicated to controlling wastewater injection in an effort to forestall potential earthquakes than Republicans.Following Ng’ombe and Boyer34, we included the extent of earthquake harm respondents’ residence or property incurred in the previous when assessing respondents’ attitudes towards earthquakes in Oklahoma. These embody minor harm, reasonable harm, main harm, and different harm as outlined in Table 1. There have been many lawsuits in Oklahoma by residents whose property incurred harm from earthquakes, however success tales associated to compensation are few45,46. However, we anticipate that individuals whose property or home incurred any earthquake-related harm will want to have their property insured in opposition to earthquakes, particularly since most insurers have acknowledged that they might be capable of insure in opposition to each pure and man-made earthquakes in Oklahoma47.Other earthquake-related factors that we anticipate to affect the resolution to have earthquake insurance embody individuals’s issues about earthquake harm, the state’s obligation to control wastewater injection, earthquake expertise, beliefs, and information of the significance of oil and fuel firms to Oklahoma. For instance, we hypothesize that those that need wastewater injection to be abruptly stopped or preferring to have earthquake insurance are warier of the dangers of earthquakes to their property and due to this fact extra prone to buy earthquake insurance.Supervised machine learning algorithmsTo obtain its targets, this research makes use of the following supervised machine learning (ML) algorithms: logit mannequin, ridge regression, LASSO, resolution bushes, and random forest. We selected supervised ML over various strategies as a result of it served our analysis as a classification downside. With supervised ML, the algorithm is educated to be taught the mapping between the enter information (predictor variables) and the output information (dependent variable), so that it could actually choose the influential variables of the output information and relying on the mannequin used, additionally make predictions on new, beforehand unseen information.Supervised machine learning is the technique of alternative for functions like ours, the place a selected goal variable should be each defined by influential enter factors and predicted. In our case, we make use of supervised ML to unravel the variables that form people’ choices concerning earthquake insurance acquisition whereas additionally predicting which respondents are prone to make a purchase order. To make clear, supervised ML successfully categorizes variables into two essential classes: output (e.g., the resolution to buy insurance or not) and enter (e.g., respondent traits like threat notion). This categorization assumes a major correlation and, doubtlessly, a causal relationship between the labeled enter and output variables, providing a strong framework for our analytical endeavors40. On the different hand, in unsupervised ML, the algorithm is given a dataset with no pre-existing labels or outputs (i.e., no correlation/causation assumptions between variables). The algorithm then makes an attempt to seek out patterns or construction in the information by itself, with none steerage or supervision. Unsupervised learning is utilized in functions the place the researcher is desirous about discovering hidden patterns or groupings in the information, comparable to clustering related clients collectively for focused marketing39,48.Logit, ridge regression, and LASSOLet the dependent variable ({y}_{i}) comply with a Bernoulli distribution. In this context, ({y}_{i}) represents whether or not a person has their residence or property insured in opposition to harm from earthquakes. The chance of detecting whether or not a person has their residence or property insured in opposition to harm from earthquakes and (P({y}_{i}=1)) denotes the chance of detecting insurance protection based mostly on the out there predictors in the information. This means, we’ve a binary classification mannequin that explains the chance of lessons ({y}_{i}=1) or ({y}_{i}=0) using the predictors described in Sect. ‘Methods and supplies’. Following James et al.40, the logit classifier is$$Pleft( {y_{i} = 1} proper) = e^{{x_{i} beta }} /1 + e^{{x_{i} beta }} ,$$
(1)
the place observations in the database are represented by i, whereby i = 1, …, N, and (beta) corresponds to unknown parameters to be estimated, ({x}_{i}) is a vector of explanatory variables. Maximum chance estimation of the following log-likelihood operate permits us to estimate the parameter estimates$${mathcal{L}}left( beta proper) = mathop sum limits_{i = 1}^{N} left( {y_{i} x_{i} beta – {textual content{log}}left( {1 + e^{{x_{i} beta }} } proper)} proper)$$
(2)
In high-dimensional settings, the place the variety of predictors is giant, collinearity between the predictors can result in unstable estimates and unreliable predictions. In such instances, penalized regression fashions are generally employed to enhance the accuracy and interpretability of the mannequin. Penalized regression strategies, comparable to ridge regression and LASSO, add a constraint to the equation to regularize the model39, leading to lowered coefficients. This, in flip, shrinks the coefficients of much less influential variables in the direction of zero, enhancing the basic efficiency of the model40. In this research, we utilized the ridge classifier and LASSO, which impose a penalty on the logit mannequin to keep away from overfitting and determine influential variables in the resolution to accumulate earthquake insurance.The ridge classifier provides a fine-tuning parameter (lambda ge 0) to Eq. (2). Estimation of the coefficients using the ridge classifier is achieved by maximizing the following modified model of Eq. (2)$${mathcal{L}}_{ridge} left( beta proper) = mathop sum limits_{i = 1}^{N} left( {y_{i} x_{i} beta – {textual content{log}}(1 + e^{{x_{i} beta }} } proper)) – lambda mathop sum limits_{j = 1}^{ok} beta_{j}^{2} ,$$
(3)
the place ok is the complete variety of penalized coefficients. However, the ridge classifier consists of all the coefficients in the last mannequin by including a squared magnitude of the coefficient (beta) as a penalty time period. Therefore, if (lambda to infty), there can be overfitting37,48. Increasing (lambda) reduces the variance however raises the bias leading to the mannequin being much less correct and extra exact. We due to this fact used cross-validation when deciding on the optimum worth of (lambda) to reduce the validation error.The LASSO affords another regularization process by which a lot of inputs is doubtlessly eradicated from the mannequin, thereby bypassing limitations of a ridge classifier. Introduced by Hastie et al.48, the LASSO’s log-likelihood to be maximized is$${mathcal{L}}_{LASSO} left( beta proper) = mathop sum limits_{i = 1}^{N} left( {y_{i} x_{i} beta – {textual content{log}}(1 + e^{{x_{i} beta }} } proper)) – lambda mathop sum limits_{j = 1}^{ok} left| {beta_{j} } proper|.$$
(4)
The LASSO algorithm provides a penalized time period to the conventional logit mannequin which units the coefficients of much less influential variables to zero. It due to this fact selects solely these inputs that are most related amongst the doubtlessly many variables in the databases. Cross-validation was additionally used to pick out the (lambda .)Decision bushesDecision bushes are a strong and broadly used device for fixing classification issues in machine learning. A choice tree is a tree-like mannequin that makes use of a set of enter options to make predictions about the goal variable39,49. The use of a call tree on this research is interesting as a result of it offers ample visible data to foretell which people are prone to insure in opposition to earthquake harm or not. The resolution tree mannequin consists of nodes and branches, the place every node represents a function, and every department represents a potential consequence for that function. The tree is constructed by recursively splitting the information into subsets based mostly on the values of the enter options till the subsets develop into homogeneous with respect to the goal variable48.Regarding coaching the mannequin, the course of includes trying to find the finest options to separate the information and making a hierarchy of nodes and branches that finest separates the information into totally different lessons. In our case, this produces a tree construction that can be utilized to foretell earthquake harm insurance uptake for the new and beforehand unseen information. One of the most essential advantages of resolution bushes is their interpretability. Unlike different machine learning algorithms that may be tough to interpret, comparable to random forest (subsequent part), resolution bushes are easy to know and might present precious insights into the decision-making course of. Decision bushes can even deal with non-linear relationships between the enter options and the goal variable, making them a precious device in lots of real-world applications39. While resolution bushes possibly simpler to interpret, they will endure from overfitting, the place the mannequin turns into too advanced and captures noise in the information. To handle this difficulty, a number of strategies have been developed, together with pruning and ensemble strategies like random forests40.Random forestsAnother ML method we use to attain our second goal is a random forest classifier. A random forest classifier has a variety of functions, together with picture classification, pure language processing, and fraud detection50. First launched by Breiman51, random forest classifiers are an ensemble learning technique that combines the outputs of a number of resolution bushes to make a last prediction. During the coaching section, the algorithm constructs a number of resolution bushes on totally different subsets of the coaching information. At every break up in a call tree, a random subset of the enter options is chosen as candidates for the break up. This course of helps to scale back overfitting and enhance the generalization capacity of the mannequin. The last prediction is made by aggregating the predictions of all particular person resolution trees40. The use of random forests has develop into more and more fashionable as a result of their capacity to deal with high-dimensional information and seize advanced nonlinear relationships in the information, making them an efficient device for a lot of real-world functions.Random forests have a number of benefits over different classification algorithms. According to Kassambara39 and James et al.40, random forests are strong to overfitting, can deal with giant datasets with excessive dimensionality, can seize advanced nonlinear relationships between enter options and output labels, and are comparatively simple to make use of and require minimal parameter tuning. In the current research, let x be a vector of inputs described in Table 1. The inputs in Table 1 will assist us to foretell y, that reveals whether or not a person has his/her property insured in opposition to earthquake harm. By so doing, the coaching process for random forests applies bootstrap aggregation or bagging to tree learners (see James et al.40 and Breiman51 for extra particulars). Thus, for a coaching set (x={x}_{1}, dots ,{x}_{n}) with responses (y={y}_{1}, dots ,{y}_{n}), bootstrap aggregation frequently selects a random pattern (Ok instances) with substitute of the coaching information units to suit bushes to the following samples.For b = 1, …, B:
Sample (n) coaching samples with substitute from (x, y), and let these be known as ({x}_{b},{y}_{b}.)
Train a classification tree ({ok}_{b}) on ({x}_{b},{y}_{b}.)
Upon completion of the coaching, make predictions for unseen samples ({x}{prime}) by taking the majority vote in the case of classification bushes or by averaging the predictions constituted of all the particular person regression bushes on ({x}{prime}) in the case of non-classification bushes as follows$$hat{ok} = frac{1}{B}mathop sum limits_{b = 1}^{B} k_{b} (x{prime} ),$$
(5)
Such a bootstrapping algorithm is taken into account to enhance mannequin efficiency as a result of it decreases mannequin variance with out rising the bias. Therefore, an estimate of the uncertainty of prediction is the normal deviation of the prediction from all particular person regression bushes x’40.$$sigma = sqrt {frac{{mathop sum nolimits_{b = 1}^{B} (k_{b} (x{prime} ) – hat{ok})^{2} }}{B – 1}} .$$
(6)
The optimum variety of bushes, B, may be decided by way of two strategies: cross-validation or the out-of-bag error (OOBE), as described in James et al.40. The OOBE measures the common prediction error for every coaching pattern ({x}_{i}) using the bushes that didn’t embody ({x}_{i})37. In basic, the testing and coaching errors are inclined to degree off after a sure variety of bushes have been fitted. As famous by Silveira et al. 37, that is the fundamental bagging method for bushes. However, random forests differ from this sample in a single keyway: they use a modified learning algorithm that selects a random subset of options at every candidate break up throughout the learning course of. Silveira et al.37 contend that the collection of extremely predictive options can result in correlated leads to the B bushes, which might affect the accuracy of the mannequin for earthquake harm insurance. However, using a subset of options in every break up, sometimes the sq. root of the complete variety of options, might help to forestall correlation and enhance accuracy40,49.Performance measuresThe efficiency of logistic regression, ridge regression, and LASSO fashions was assessed by analyzing the magnitude and indicators of the coefficients. Ridge regression and LASSO, being regularization strategies, are inclined to push coefficients in the direction of zero. Consequently, variables retaining non-zero coefficients in LASSO or having comparatively bigger coefficients in ridge regression are thought-about extra influential39. This analysis offered insights into the extent to which these fashions successfully determine the most influential factors affecting earthquake insurance uptake.As beforehand talked about, in the case of resolution bushes and random forests, people have been labeled as having a excessive chance of getting earthquake harm insurance if their predicted chance exceeded 0.5. Therefore, to guage the efficiency of the resolution tree and random forest fashions for predictive modeling, numerous efficiency measures have been employed, together with accuracy, sensitivity, specificity, and precision that are computed as:$$Accuracy = frac{TN + TP}{{TN + TP + FN + FP}}$$
(7)
$$Precision = frac{TP}{{TP + FP}}$$
(8)
$$Sensitivity = frac{TP}{{TP + FN}}$$
(9)
$$Specificity = frac{TN}{{TN + FP}}$$
(10)
the place TN are true negatives, FN are false negatives, TP are true positives, and FP are false positives. True negatives discuss with when the mannequin identifies a knowledge level as a part of the damaging class, and this identification is appropriate (e.g., the mannequin identifies a person A as non-buyer of insurance by way of learning, and it does match with an precise resolution of A in the information). In the identical context, FN, TP, and FP occur by false damaging (e.g., determine A as a non-buyer however doesn’t match with precise resolution), true optimistic (e.g., determine A as a purchaser and does match with precise resolution), and false optimistic (e.g., determine A as a purchaser however doesn’t match with precise resolution), respectively. These metrics collectively offered insights into the resolution bushes and random forest fashions’ effectiveness in appropriately figuring out these prone to buy earthquake insurance.Moreover, every efficiency measure falls inside the vary of 0 to 1, with increased values indicating higher mannequin efficiency for every metric. For occasion, if FN and FP are unlikely to occur (i.e., the mannequin precisely identifies people’ optimistic/damaging choices by way of learning), the denominator of Accuracy measure can be nearer to the numerator. Therefore, the measure can be nearer to 1. Accuracy depicts a fraction of instances appropriately labeled out of the complete instances. Precision, often known as optimistic predicted worth, measures the proportion of appropriately predicted optimistic instances out of the complete variety of optimistic instances in the dataset39. Sensitivity – or recall or the true optimistic charge, calculates the proportion of optimistic instances that are appropriately labeled out of the complete variety of actual optimistic instances in the dataset40. Finally, Specificity measures the proportion of damaging instances that are appropriately recognized out of the complete variety of damaging instances in the dataset. These efficiency measures are sometimes derived from a confusion matrix, which is a 2 × 2 desk that summarizes the outcomes of a machine learning algorithm by classifying observations as TP, TN, FP, and FN.Estimation methodsIn this research, we utilized a spread of R software program packages together with glmnet, caret, rpart, and rattle to estimate all our models52,53,54,55,56. To receive dependable estimates, we divided our information into separate coaching and testing sets40. The coaching set comprised 70% of the complete observations and was used to estimate mannequin parameters, whereas the remaining 30% have been used for out-of-sample estimation and prediction in the testing set. We used the glmnet package deal to implement ridge and LASSO classifiers and decide the optimum λ values that minimized cross-validation prediction error. Specifically, we carried out cross-validation to determine the optimum values of (lambda) that gave the finest fashions and located that (lambda) = 0.0066 and (lambda) = 0.0003 have been optimum for ridge regression and LASSO, respectively40. We generated plots of the cross-validation error as a operate of log ((lambda)) for the ridge and LASSO classifiers (Fig. 1 and a pair of, respectively). Note that rising the worth of (lambda) for the ridge regression (LASSO) tends to shrink coefficients of factors that affect earthquake insurance towards zero (or precisely zero).Figure 1Plot of cross validation error for the ridge classifier.Figure 2Plot of cross validation error for the LASSO classifier.As for predictive fashions, notably the resolution tree classifier, as a result of the giant variety of variables used to foretell people who insure their property in opposition to earthquake, the resolution tree generated was too giant, necessitating pruning. Pruning includes simplifying the mannequin by eradicating branches or nodes that don’t contribute to its predictive energy and prevents overfitting. We pruned the resolution tree using recursive partitioning and chosen the complexity parameter using tenfold cross-validation55. As talked about earlier than, to additional enhance on the efficiency of the resolution tree mannequin, we additionally employed random forests. Random forests can scale back the mannequin’s variance and enhance its accuracy40.
https://www.nature.com/articles/s41598-023-48568-6