We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.

Hi Lionel,

Great sharing.

I want to combine linear regression with EM algorithm in R without using any package existing in R, do you have any good notes/ resource to share?

Great post, very easy to understand.

There is one thing that bothered me a little though

It's always good to check the assumptions and make sure everything checks, but it's not that crucial in some cases.

You only really need homoscedasticity and normality when you want a model for inference. See, sometimes the normality issue is not even a big deal thanks to the 'pretty good' robustness of the Wald t statistic (the homoscedasticity is a lot more important). You can also even just bootstrap confidence intervals and standard errors.

For predictive purposes you can use a linear model that doesn't check any assumption and get pretty good results with it. Of course a lot of classical results won't work.

Thanks for your comment.

In my area (experimental ecology) models are always used for inference purposes. I guess that for predictive purposes you would only care about the parameters so the distribution assumption are not so important. The linearity assumption will however be very important, fitting a straight line to a quadratic curve would make little sense, so I guess you would have to check the predictive ability of your linear regression. I would actually argue that in most cases we would not expect linear relations but rather more complex functions leading to the use of more complex model families.

I think what cosmicmessenger is trying to point out is the fact, that a linear model is very often a sufficient or quite good approcimation of dependencies.

Imagine eg. a quadratic curve x^2, but data only gathered in the interval x = [2;4]. Linear approximation to the parabola works quite well there, although there will be systematic bias, but possibly very small compared the "real" error. In fact it turns out to be a question of how many "turns" your function does in the area you want to model.

Might it be that you forgot a link behind the word "here":

"The second function Anova() allow you to define which type

of sum-of-square you want to calculate (here is a nice explanation of

their different assumptions)"

?

I totally did forgot it, updated the post now, thanks!

No worries!

Exactly the link I was thinking about :)

Also: Good thing you removed the phrase "causal". There is a directional idea behind regression (which can easily be seen checking how the residuals are calculated), but direct causality is only one possible explanation for suspected dependencies.

"Regression is different from correlation because it try to put

variables into equation and thus explain causal relationship between them, ..."

Sorry, but this is clearly wrong. First, a simple regression (Y = a + b*X) yields basically the same parameters as a correlation between the two variables as both procedures model a linear relationship. Second, what if you change your variables, i.e. X = a + b*Y. Do you magically switch the direction of causality?

Thanks for your critical comment.

Parameters from a linear regression only makes sense if they can be linked to some hypothesis/ideas about how the variables are linked. Switching the response and explanatory variables will not switch the direction of causality, the mechanisms that generated your data only works in one way. It is then at the discretion of the analyst to find what makes sense. R will fit all possible models but only a few will be relevant.

Correlation is different from linear regression as there are no directionality in correlation coefficients while linear regression basically says: "X causes changes in Y". Even if the parameters yielded are very similar (strong coefficient of correlation will certainly yield large regression coefficients), regression clearly puts a directionality in the relationships between two variables.

Thank you for your reply, Lionel.

You are right, saying that the analyst has to find what makes sense. Nevertheless, I think it is dangerous, especially for students who stumble upon your post, to read a claim like "This statistical method implies causal relationships between your variables."

The statistical method has nothing to do with causality as it is mere interpretation by the analyst or scientist.

Let me further elaborate my claim regarding the similarity of correlation and a simple linear regression (only one predictor): every correlation between two variables can be modeled as a linear regression. The decision on one of the two variables as the "predictor" and the other as the dependent variable is purely theoretical. And the fit of your model cannot say whether your assumption is true or false.

You are right, I do get your concerns that people might be over-confident in their interpretation of simple regression models. I removed the word "causal" from the post.

In the end I do think that regression analysis enable us to get a glimpse at the processes (causes) generating our data. I am however aware that the word "causal" is a very strong word with some deep philosophical issues bond to it (ie is our data ever going to be able to reflect the processes that we want to study?)

Thank you for being careful with this issue! (:

Very clear writing. Thank heavens. I've seen people bungle these simple ideas to the point of incomprehensibility. Now I've got something I can point others to.

FYI--as the other comments state, all the causality statements by the author are WRONG!

Even as a generalist, I know the phrase "correlation does not equal causation," so I didn't really have a problem with the use of the word. I get your point (more fully expressed by Charles Wheelan in his book _Naked Statistics_) "Regression analysis is meant to be used when the relation between variables can be expressed using a linear equation" and "can only demonstrate an association between two variables."

However, because we call one variable "predictor" - a looser use of term of "causality" does come in to play. I'd never claim to be someone who could take on all the technical details of causality, etc. but I can say I've used it for sourcing direct relationships such as that between processor loads and call failures. However, I'd be one of the first to scoff at the presentation of "lack of pirates" and the increase in global ocean temperatures.

So, I'll just have to bow out here, with a hyperlink to the same type of discussion on stack overflow

here's the quick answer too--methodology allows you to infer causality, IF you had random assignmt then you can infer it, it's the fundamental tenet of science so enjoy and use liberally, rich

I'm not seeing a difference between the original and transformed validation plots? Can you point out the specific differences in the pictures? (ala which one of these is not like the other :)

Thanks for the article

In the top-left graphs the dispersion of the residuals on the right side is lower. In the top-right graphs the normality of the residuals looks better.

This aside I would also be a bit careful about this model, and would try adding more variables.

How important is it to use correlation with data? What areas can I apply Linear Regression?

Correlation is usually a first step in the analysis of data to single out some potential links between the variables of interests. Linear regression can be applied to any areas where you have good reason to think that one term is linearly linked to another, i.e. if plant productivity is proportionally increasing with temperature. If you have

good reason to think that this linearity assumption cannot be held by your data, and that classical data transformation do not work/makes sense, then you have to use more complex models like Generalized

Additive Models.

This is important because there are people that use the language of R to gather information or data interpretation in the city of Chicago. Now I'm stuck with integrating the R language city of Chicago data projects into correlation and linear regression. Mr. Lionel how can this be done?

Small typo in Pearson Correlation coefficient definition. The denominator is product of standard deviation of two variables rather than their variances. Very good article.

Thanks! Updated the post.

Chun-Jie Liu• 4 years agoGreat sharing