Nice post, but here are a few suggestions for improvement. The comment "the input variables need to be approximately normal in distribution" is not entirely concise. What is required that only and solely the residuals are normally distributed. This becomes clear when we think of experimental data where the input variables are chosen numbers like 1,2,3,4... In this case the input variables are then certainly uniformly distributed and not normally distributed. What counts is that ONLY the residuals need to be normally distributed for getting the standard errors right. However, even without this assumption the regression is still valid. And if they are not, the residuals become normally distributed in large sample due to the Law of Large Numbers (LLN). That is indeed the most beautiful gift of the universe to the statistician :) - as this says "don't worry if the residuals are not normally distributed, if you have a large sample, they will automatically be."
Furthermore, we also do not require that "input variables and output variables are approximately linear" - this is what we are testing and not an assumption. The coefficient is the only component that needs to enter linearly. Therefore, if you think that the input variable is quadratic, it is perfectly alright to regress y = a + (x^2)*b + e.
Thirdly the statement that "the output variable needs a constant variance" is also not correct in this context. We only require the conditional distribution of the output variable given the input variable to have an equal variance for homoskedasticity to hold and thus not to use heteroskedasticity robust standard errors (White). Here is why we take the logarithm of the variables, as there are indications that taking the logs decreases the heteroskedasticity of the residuals and thus makes the model more efficient when estimating the standard errors.
Under a generative model you *might* assume strong conditions: like the variables being normally distributed. See Gelman for some comments for and against this position http://andrewgelman.com/2013/0... . For the y = a + x^2 + e, you have y is linear in x^2 (x^2 either being a treated column , an effect or some other conversion be it implicit or explicit). There are a lot of different ways to frame regression.
Hi here, I wish to know if after transformation we may have data passing the normality tests
Why would anyone calculate the spread on Log values and not on the real measured data?
Hi Soren,
It can be relevant if you are interested (for example) in comparing the means of the two populations (after performing a log transform) by using something like a t-test.
Since the transformation preserves the location of statistics such as the median, the t-test may even be interpreted in the original scale.
And of course, in various situations, the scaled data may be a relevant quantity of interest.
I'd be curious to read more thoughts on the matter.
Best,
Tal
I like the practical approach of this blog. Very useful. Only I wonder: with this 'signed log' function you appear to get a bimodal distribution, which is probably much harder to model. How to deal with this?
Hi Lydia,
It depends on your analysis. But it could well be that this is the correct distribution to work with.
Or - that when can go and use non-parametric methods in order to mitigate the complex distribution.
If you or others have more to add - I'd be happy to read.
Tal