Tentang Multiple Regression

Author:

Introduction

Multiple regression analysis is a statistical technique that analyzes the relationship between two or more variables and uses the information to estimate the value of the dependent variable[6]. It is an extension of linear regression that uses two or more independent variables to predict the outcome of a dependent variable[3]. The objective of multiple regression analysis is to use the independent variables whose values are known to predict the value of the single dependent value[1]. 

Here are some key principles of multiple regression statistics:

  • Linear Relationship: The first assumption of multiple linear regression is that there is a linear relationship between the dependent variable and each of the independent variables[2]. 
  • Homoscedasticity: Multiple linear regression assumes that the amount of error in the residuals is similar at each point of the linear model. This scenario is known as homoscedasticity[2].
  • Model Parameters: Model parameters in a multiple regression model are usually estimated using ordinary least squares minimizing the sum of squared deviations between each observed value and predicted values[4]. 
  • Multiple Independent Variables: Multiple regression is a type of regression where the dependent variable shows a linear relationship with two or more independent variables[2]. 
  • Multicollinearity: Many difficulties tend to arise when there are more than five independent variables in a multiple regression equation. One of the most frequent is the problem that two or more of the independent variables are highly correlated to one another. This is called multicollinearity[5]. 

Multiple regression analysis is used extensively in econometrics and financial inference[3]. It is a powerful tool for predicting outcomes based on multiple variables.

Read also: [[What is the statistics software to handle multiple regression]]

Citations:

[1] https://www.sciencedirect.com/topics/social-sciences/multiple-regression

[2] https://corporatefinanceinstitute.com/resources/data-science/multiple-linear-regression/

[3] https://www.investopedia.com/terms/m/mlr.asp

[4] https://influentialpoints.com/Training/multiple_linear_regression-principles-properties-assumptions.htm

[5] https://home.csulb.edu/~msaintg/ppa696/696regmx.htm

[6] https://byjus.com/maths/multiple-regression/


Multicollinearity

Multicollinearity is a statistical concept that occurs when two or more independent variables in a regression model are highly correlated with each other[1][3][5]. It is a problem because independent variables should be independent[6]. Multicollinearity can lead to several issues, including:

  • Poorly estimated or inflated coefficients[2].
  • Coefficients with signs that do not make sense[2].
  • Inflated standard errors for coefficients[2].
  • Unreliable regression estimates[4].
  • Difficulty in determining the effect of each predictor on the response[2].

Multicollinearity can exist when two independent variables are highly correlated, or if an independent variable is computed from other variables in the dataset[1]. It can also happen if two independent variables provide similar and repetitive results[1]. Multicollinearity can be detected by calculating correlation coefficients for all pairs of predictor variables[5]. A correlation value of at least 0.4 is sometimes interpreted as indicating a multicollinearity problem, but this is incorrect[5]. Multicollinearity can only be detected by looking at all variables simultaneously[5]. 

To address multicollinearity, the Variance Inflation Factor (VIF) can provide information about which variable or variables are redundant, and thus the variables that can be removed from the model[1]. Another solution is to combine the correlated variables into a single variable[6]. Centering the variables can also help to reduce multicollinearity[6]. 

In summary, multicollinearity is a problem that occurs when two or more independent variables in a regression model are highly correlated with each other. It can lead to several issues, including poorly estimated coefficients, unreliable regression estimates, and difficulty in determining the effect of each predictor on the response. Multicollinearity can be detected by calculating correlation coefficients for all pairs of predictor variables, and it can be addressed by using the VIF, combining correlated variables, or centering the variables.

Read also How can you detect multicollinearity.

Citations:

[1] https://www.investopedia.com/terms/m/multicollinearity.asp

[2] https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-multiple-regression/multicollinearity.html

[3] https://online.stat.psu.edu/stat462/node/177/

[4] https://www.statisticshowto.com/multicollinearity/

[5] https://en.wikipedia.org/wiki/Multicollinearity

[6] https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/


How can you detect multicollinearity

There are several ways to detect multicollinearity in a regression model. Here are some methods:

  1. Bivariate Correlation: One popular detection method is based on the bivariate correlation between two predictor variables. If the correlation coefficient is above a certain threshold, such as 0.8, it may indicate multicollinearity[1].
  2. Variance Inflation Factor (VIF): The VIF is a measure of how much the variance of the estimated regression coefficient is increased due to multicollinearity in the model. A VIF value of 1 indicates no multicollinearity, while a value greater than 1 indicates some degree of multicollinearity. A VIF value greater than 5 or 10 is often considered a sign of severe multicollinearity[2].
  3. Eigenvalues: Eigenvalues can be used to detect multicollinearity in a regression model. If one or more eigenvalues are close to zero, it may indicate multicollinearity[4].
  4. Condition Number: The condition number is a measure of how sensitive the regression coefficients are to small changes in the data. A condition number greater than 30 indicates that there may be multicollinearity in the model[5].
  5. High Standard Errors: If the standard errors of the regression coefficients are much larger than the coefficients themselves, it may indicate multicollinearity[1].
  6. Wildly Different Coefficients: If the coefficients of the same variable are wildly different in different samples, it may indicate multicollinearity[1].

In summary, there are several ways to detect multicollinearity in a regression model, including bivariate correlation, VIF, eigenvalues, condition number, high standard errors, and wildly different coefficients. It is important to check for multicollinearity in a regression model because it can lead to unreliable estimates and difficulty in interpreting the results.

Read also How can you address multicollinearity.

Citations:

[1] https://www.theanalysisfactor.com/eight-ways-to-detect-multicollinearity/

[2] https://towardsdatascience.com/everything-you-need-to-know-about-multicollinearity-2f21f082d6dc

[3] https://www.linkedin.com/advice/0/how-can-you-detect-avoid-multicollinearity-skills-data-analysis

[4] https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/

[5] https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea

[6] https://www.section.io/engineering-education/multicollinearity/


How can you address multicollinearity

Multicollinearity can be a problem in regression models, but there are several ways to address it. Here are some methods:

  1. Remove Highly Correlated Predictors: One way to address multicollinearity is to remove one or more variables that are highly correlated with others and leave the more significant ones in the set[1].
  2. Combine Correlated Variables: Another solution is to combine correlated variables into a single variable[6]. This can be done by taking the average or sum of the correlated variables[1].
  3. Use Principal Component Analysis (PCA): PCA is a technique that can be used to reduce the dimensionality of the data by transforming the original variables into a smaller set of uncorrelated variables[4].
  4. Regularization: Regularization is a technique that can be used to reduce the impact of multicollinearity by adding a penalty term to the regression equation[2].
  5. Centering the Variables: Centering the variables can also help to reduce multicollinearity. This involves subtracting the mean of each variable from each observation[6].
  6. Increase Sample Size: Increasing the sample size can also help to reduce multicollinearity by increasing the variability of the data[5].

In summary, there are several ways to address multicollinearity in a regression model, including removing highly correlated predictors, combining correlated variables, using PCA, regularization, centering the variables, and increasing the sample size. It is important to address multicollinearity in a regression model because it can lead to unreliable estimates and difficulty in interpreting the results.

Citations:

[1] https://scottmduda.medium.com/identifying-and-addressing-multicollinearity-in-regression-analysis-ca86a21a347e

[2] https://www.section.io/engineering-education/multicollinearity/

[3] https://towardsdatascience.com/multi-collinearity-in-regression-fe7a2c1467ea

[4] https://blog.minitab.com/en/understanding-statistics/handling-multicollinearity-in-regression-analysis

[5] https://www.linkedin.com/advice/0/how-can-you-detect-avoid-multicollinearity-skills-data-analysis

[6] https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/


Homoscedasticity

Homoscedasticity, also known as homogeneity of variance, is a statistical concept that describes the consistency of variance in a dataset[6]. It is an assumption of equal or similar variances in different groups being compared[1]. In simpler terms, it means that the variability in the dependent variable does not change as the independent variable changes[6]. Homoscedasticity is an important assumption of linear regression models[4]. 

Here are some key points about homoscedasticity:

  • Homoscedasticity refers to the consistency of variance in a dataset[6].
  • It means that the spread or dispersion of data points remains constant across different levels of an independent variable[6].
  • Homoscedasticity is an assumption of equal or similar variances in different groups being compared[1].
  • It is an important assumption of linear regression models[4].
  • Homoscedasticity is generally considered desirable in statistical analysis because it helps to ensure the validity and reliability of results[6].
  • When data is homoscedastic, predictions will be more accurate[6].
  • Heteroscedasticity, on the other hand, occurs when the variance of the data points is not consistent across different levels of the independent variable[6].
  • Heteroscedasticity can lead to biased or inefficient estimates, affecting the accuracy and reliability of results[6].

In summary, homoscedasticity is a statistical concept that describes the consistency of variance in a dataset. It is an important assumption of linear regression models and generally considered desirable in statistical analysis because it helps to ensure the validity and reliability of results.

Read also:

  1. Homocedasticity or heterocedasticity
  2. How to get homoscedasticity

Citations:

[1] https://www.scribbr.com/frequently-asked-questions/what-is-homoscedasticity/

[2] https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity

[3] https://www.investopedia.com/terms/h/homoskedastic.asp

[4] https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/homoscedasticity/

[5] https://www.wallstreetmojo.com/homoscedasticity/

[6] https://uedufy.com/what-is-homoscedasticity-assumption-in-statistics/


Homocedasticity or heterocedasticity

Homoscedasticity and heteroscedasticity are two statistical concepts that describe the consistency of variance in a dataset. Homoscedasticity refers to the situation where the variance of data points remains constant across different levels of an independent variable, while heteroscedasticity occurs when the variance of the data points is not consistent across different levels of the independent variable. Here are some points to consider when comparing homoscedasticity and heteroscedasticity:

Advantages of Homoscedasticity:

  • Homoscedasticity is generally considered desirable in statistical analysis because it helps to ensure the validity and reliability of results[6].
  • When data is homoscedastic, predictions will be more accurate[6].
  • Homoscedasticity is an important assumption of linear regression models[4].

Advantages of Heteroscedasticity:

  • Heteroscedasticity can occur naturally in some datasets, and it is not always possible to transform the data to achieve homoscedasticity[5].
  • Heteroscedasticity can provide valuable information about the relationship between the independent and dependent variables[1].
  • Heteroscedasticity can be addressed using weighted least squares regression, which can provide more accurate estimates than ordinary least squares regression[1].

In summary, homoscedasticity is generally considered desirable in statistical analysis because it helps to ensure the validity and reliability of results, while heteroscedasticity can provide valuable information about the relationship between the independent and dependent variables. However, it is important to check for homoscedasticity in a regression model because it can lead to unreliable estimates and difficulty in interpreting the results.

Read also What are some common causes of heteroscedasticity in regression analysis.

Citations:

[1] https://www.wallstreetmojo.com/homoscedasticity/

[2] https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity

[3] https://www.wallstreetmojo.com/heteroskedasticity/

[4] https://www.investopedia.com/terms/h/homoskedastic.asp

[5] https://www.vexpower.com/brief/homoskedasticity

[6] https://uedufy.com/what-is-homoscedasticity-assumption-in-statistics/

What are some common causes of heteroscedasticity in regression analysis

There are several common causes of heteroscedasticity in regression analysis, including:

  1. Wide Range of Data: Heteroscedasticity can occur when there is a wide range of data in the dataset. When the differences between the smallest and largest values are significant, larger residuals would be associated with higher values, causing an unequal variance of the residuals and resulting in heteroscedasticity[1].
  2. Change in Factor Proportionality: Heteroscedasticity can occur due to a change in factor proportionality. For example, in a regression analysis of food expenditures and income, people with lower incomes tend to have restricted food expenditures based on their budget. As incomes increase, people tend to spend more on food as they have more options and fewer budget restrictions. For wealthier people, they can access a variety of foods, leading to a change in factor proportionality and heteroscedasticity[3].
  3. Nature of the Variable: The nature of the variable can also be a major cause of heteroscedasticity. For example, in a regression analysis of family income and spending on luxury items, there is a strong, positive association between income and spending. However, the variance of the error term differs across values of the independent variable, leading to heteroscedasticity[6].
  4. Omission of Variables: Heteroscedasticity can also be caused due to the omission of variables from the model. For example, if the variable income is deleted from an income saving model, the researcher would not be able to interpret anything from the model, leading to heteroscedasticity[5].
  5. Cross-Sectional Studies: Heteroscedasticity is more common in cross-sectional types of data than in time series types of data[5].

In summary, heteroscedasticity can occur due to a wide range of data, change in factor proportionality, nature of the variable, omission of variables, and cross-sectional studies. It is important to check for heteroscedasticity in a regression model because it can lead to unreliable estimates and difficulty in interpreting the results.

Citations:

[1] https://corporatefinanceinstitute.com/resources/data-science/heteroskedasticity/

[2] https://www.wallstreetmojo.com/heteroskedasticity/

[3] https://www.statology.org/heteroscedasticity-regression/

[4] https://itfeature.com/correlation-and-regression-analysis/introduction-reasons-and-consequences-of-heteroscedasticity

[5] https://www.statisticssolutions.com/heteroscedasticity/

[6] https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/homoscedasticity/

What is the statistics software to handle multiple regression

There are several statistical software options available for multiple regression analysis. Here are some major software options:

  1. JMP: JMP is a powerful software for multiple linear regression. It provides high-quality plots that can be used in publications.
  2. SAS Visual Analytics: SAS Visual Analytics is a software that allows you to visualize, analyze, and harness the power of your data.
  3. Minitab: Minitab is a statistical software that simplifies statistical charting and analysis.
  4. XLSTAT: XLSTAT is a statistical software for Excel that can be used to create simple or multiple linear regression models for explanation or prediction.
  5. SPSS
  6. etc

Read also Can Excel handle multiple regression analysis

In summary, there are several major software options available for multiple regression analysis, including JMP, EngineRoom, SAS Visual Analytics, Minitab, XLSTAT, NCSS, RegressIt, and Analyse-it. These software options offer a range of features and capabilities for regression analysis, and the choice of software will depend on the specific needs of the user.

Citations:

[1] https://www.researchgate.net/post/Which-is-the-best-software-for-the-regression-analysis

[2] https://www.researchgate.net/post/Is-there-any-software-available-for-multiple-regression-analysis

[3] https://www.getapp.com/business-intelligence-analytics-software/data-analytics/f/regression-analysis/

[4] https://www.ncss.com/software/ncss/regression-analysis-in-ncss/

[5] https://regressit.com

[6] https://analyse-it.com/landing/multiple-linear-regression-software


Can Excel handle multiple regression analysis

Yes, Microsoft Excel can handle multiple regression statistics. Excel has built-in data analysis tools that can be used to perform multiple regression analysis. The Regression data analysis tool in Excel can be used to create simple or multiple linear regression models for explanation or prediction[1][2][3]. Excel also supports the LINEST and TREND functions for multiple regression analysis[1]. Additionally, there are several free and paid Excel add-ins available, such as RegressIt, that can perform multivariate descriptive data analysis and regression analysis with high-quality table and chart output[6]. In summary, Microsoft Excel is a useful tool for performing multiple regression analysis, and it offers a range of features and capabilities for regression analysis.

Citations:

[1] https://real-statistics.com/multiple-regression/multiple-regression-analysis/multiple-regression-analysis-excel/

[2] https://www.wikihow.com/Run-a-Multiple-Regression-in-Excel

[3] https://www.statology.org/multiple-linear-regression-excel/

[4] https://answers.microsoft.com/en-us/msoffice/forum/all/running-a-regression-analysis-with-multiple-y/e69ffad4-311e-47a7-80c1-616f93915a60

[5] https://youtube.com/watch?v=Q5JlRmmHzsg

[6] https://regressit.com

Tutorial:

Tutorial Excel untuk analisis multiple regression