statistics - w3toppers.com

Generating a uniform distribution of INTEGERS in C

Let’s assume that rand() generates a uniformly-distributed value I in the range [0..RAND_MAX], and you want to generate a uniformly-distributed value O in the range [L,H]. Suppose I in is the range [0..32767] and O is in the range [0..2]. According to your suggested method, O= I%3. Note that in the given range, there are … Read more

How to incrementally sample without replacement?

If you know in advance that you’re going to want to multiple samples without overlaps, easiest is to do random.shuffle() on list(range(100)) (Python 3 – can skip the list() in Python 2), then peel off slices as needed. s = list(range(100)) random.shuffle(s) first_sample = s[-10:] del s[-10:] second_sample = s[-10:] del s[-10:] # etc Else … Read more

predict.lm() in a loop. warning: prediction from a rank-deficient fit may be misleading

You can inspect the predict function with body(predict.lm). There you will see this line: if (p < ncol(X) && !(missing(newdata) || is.null(newdata))) warning(“prediction from a rank-deficient fit may be misleading”) This warning checks if the rank of your data matrix is at least equal to the number of parameters you want to fit. One way … Read more

scipy, lognormal distribution – parameters

The distributions in scipy are coded in a generic way wrt two parameter location and scale so that location is the parameter (loc) which shifts the distribution to the left or right, while scale is the parameter which compresses or stretches the distribution. For the two parameter lognormal distribution, the “mean” and “std dev” correspond … Read more

Stepwise regression using p-values to drop variables with nonsignificant p-values

Show your boss the following : set.seed(100) x1 <- runif(100,0,1) x2 <- as.factor(sample(letters[1:3],100,replace=T)) y <- x1+x1*(x2==”a”)+2*(x2==”b”)+rnorm(100) summary(lm(y~x1*x2)) Which gives : Estimate Std. Error t value Pr(>|t|) (Intercept) -0.1525 0.3066 -0.498 0.61995 x1 1.8693 0.6045 3.092 0.00261 ** x2b 2.5149 0.4334 5.802 8.77e-08 *** x2c 0.3089 0.4475 0.690 0.49180 x1:x2b -1.1239 0.8022 -1.401 0.16451 x1:x2c -1.0497 … Read more

Standard Deviation in R Seems to be Returning the Wrong Answer – Am I Doing Something Wrong?

Try this R> sd(c(2,4,4,4,5,5,7,9)) * sqrt(7/8) [1] 2 R> and see the rest of the Wikipedia article for the discussion about estimation of standard deviations. Using the formula employed ‘by hand’ leads to a biased estimate, hence the correction of sqrt((N-1)/N). Here is a key quote: The term standard deviation of the sample is used … Read more

What is a good solution for calculating an average where the sum of all values exceeds a double’s limits?

You can calculate the mean iteratively. This algorithm is simple, fast, you have to process each value just once, and the variables never get larger than the largest value in the set, so you won’t get an overflow. double mean(double[] ary) { double avg = 0; int t = 1; for (double x : ary) … Read more

Pandas – Compute z-score for all columns

Using Scipy’s zscore function: df = pd.DataFrame(np.random.randint(100, 200, size=(5, 3)), columns=[‘A’, ‘B’, ‘C’]) df | | A | B | C | |—:|—-:|—-:|—-:| | 0 | 163 | 163 | 159 | | 1 | 120 | 153 | 181 | | 2 | 130 | 199 | 108 | | 3 | 108 | … Read more

Pythonic way of detecting outliers in one dimensional observation data

The problem with using percentile is that the points identified as outliers is a function of your sample size. There are a huge number of ways to test for outliers, and you should give some thought to how you classify them. Ideally, you should use a-priori information (e.g. “anything above/below this value is unrealistic because…”) … Read more

Interpretation of ordered and non-ordered factors, vs. numerical predictors in model summary

This is not really a mixed-model specific question, but rather a general question about model parameterization in R. Let’s try a simple example. set.seed(101) d <- data.frame(x=sample(1:4,size=30,replace=TRUE)) d$y <- rnorm(30,1+2*d$x,sd=0.01) x as numeric This just does a linear regression: the x parameter denotes the change in y per unit of change in x; the intercept … Read more