Chapter 14 Creating a simulated data set

From the tutorial on this page

14.1 Part 1: Independent samples from a normal distribution

Consider the following first before you start doing stuff: - How many subjects are in each condition? - What are the means and standard deviations of each group?

Set that shit below.

# number of subjects per group
A_sub_n <- 50
B_sub_n <- 50

# distribution parameters
A_mean  <- 10
A_sd    <- 2.5

B_mean  <- 11
B_sd    <- 2.5

Now generate scores for each group

A_scores <- rnorm(A_sub_n, A_mean, A_sd)
B_scores <- rnorm(B_sub_n, B_mean, B_sd)

Technically you could stop here and just analyze the data in this fashion…but its better to organize it into a table. One that looks like something you would import after real data collection.

So do that next; make it look nice.

dat <- tibble(
  sub_condition = rep( c("A", "B"), c(A_sub_n, B_sub_n) ),
  score = c(A_scores, B_scores)
)

head(dat)
## # A tibble: 6 x 2
##   sub_condition score
##   <chr>         <dbl>
## 1 A              8.50
## 2 A              8.26
## 3 A             13.4 
## 4 A             11.6 
## 5 A              4.04
## 6 A              9.68

Always perform a quality and consistency check on your data to verify that shit’s ok.

dat %>%
  group_by(sub_condition) %>%
  summarise(n = n() ,
            mean = mean(score),
            sd = sd(score))
## # A tibble: 2 x 4
##   sub_condition     n  mean    sd
##   <chr>         <int> <dbl> <dbl>
## 1 A                50  10.1  2.53
## 2 B                50  10.4  2.04

14.2 Part 2: Creating data sets with quantitative and categorical variables

From the web page at this link

14.2.1 2.a. DATA WITH NO DIFFERENCE AMONG GROUPS

Critically important notes to know: When you use the rep() function, there are several different arguments you can specify inside it that control how stuff is repeated:

  • using rep(x, each= ) repeats things element-wise; each element gets replicated n times, in order
rep(c("A","B"), each=3)
## [1] "A" "A" "A" "B" "B" "B"
  • using rep(x, times= ) repeats the sequence; the vector as a whole, as it appears, will be repeated with one sequence following the next
rep(c("A","B", "C", "D", "E"), times=3)
##  [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"
  • using rep(x, length.out) repeats only the number of elements you specify, in their original order
rep(c("A","B", "C", "D", "E"), length.out=3)
## [1] "A" "B" "C"

In this particular data, we want every combination of group and letter to be present ONCE.

letters=c("A","B","C","D","E")

tibble(group = rep(letters[1:2], each = 3),
           factor = rep(LETTERS[3:5], times = 2),
           response = rnorm(n = 6, mean = 0, sd = 1) )
## # A tibble: 6 x 3
##   group factor response
##   <chr> <chr>     <dbl>
## 1 A     C        -0.355
## 2 A     D         0.178
## 3 A     E         0.610
## 4 B     C        -0.465
## 5 B     D         0.401
## 6 B     E        -0.727

14.2.2 2.b. Data WITH A DIFFERENCE among groups

What if we want data where the means are different between groups? Let’s make two groups of three observations where the mean of one group is 5 and the other is 10. The two groups have a shared variance (and so standard deviation) of 1.

14.2.2.1 Some notes first

Creating a difference between the two group’s average score means we have to tell R to sample itteratively from distributions with different means. We do this by specifying a vector of means within rnorm, like so:

response = rnorm(n = 6, mean = c(5, 10), sd = 1)
response
## [1] 2.754076 9.609848 5.285164 9.798403 4.205527 7.954296

You can see that: 1. draw 1 is from the distribution \((\mu=5,\sigma=1)\) 2. draw 2 is from the distribution \((\mu=5,\sigma=1)\)

And this process repeats a total of six times.

And if you happen to also specify a vector of standard deviations (purely to demonstrate what is happening, we won’t actually do this), the first mean is paired with the first SD; the second mean is paired with the second SD; and so on.

rnorm(n = 6, mean = c(5, 10), sd = c(2,0.1))
## [1]  4.987252  9.950492  4.149175  9.974044  2.077211 10.115511

14.2.2.2 Ok, back to creating the data

If you want there to be differences between the groups, we need to change the way the vector of factors is replicated, in addition to specifying the vector of means. We want to ensure that the sequence of “A”, “B” in the group column matches the sequence repeated in the response column.

Here we are going to use length.out so that the whole sequence of A,B is repeated exactly in line with the alternating drawing from \(\mu=5\), \(\mu=10\).

It’s often best to do this by building each thing separately, and then combining it into a tibble when you have it figured out.

group=rep(letters[1:2], length.out = 6)
group
## [1] "A" "B" "A" "B" "A" "B"
response=rnorm(n = 6, mean = c(5, 10), sd = 1)
response
## [1] 4.126291 8.969935 3.874072 9.828078 5.393620 9.125079
tibble(group,
       response)
## # A tibble: 6 x 2
##   group response
##   <chr>    <dbl>
## 1 A         4.13
## 2 B         8.97
## 3 A         3.87
## 4 B         9.83
## 5 A         5.39
## 6 B         9.13

14.2.3 2.c. Data with MULTIPLE QUANTITATIVE VARIABLES with groups

14.3 Part 3: Repeatedly simulate samples with replicate()

Instead of drawing values one at a time from a distribution, we want to do it many times. This is a job for replicate(). What replicate() does is run a function repeatedly.

The replicate() function will perform a given operation as many times as you tell it to. Here we tell it to generate numbers from the distribution \(N~(\mu=0, \sigma=1)\), three times (as specified in the n=3 argument in line one)

replicate(n = 3, 
          expr = rnorm(n = 5, mean = 0, sd = 1), 
          simplify = FALSE )
## [[1]]
## [1] -0.51287813  0.63520496  0.91589225  0.07678404  0.14043466
## 
## [[2]]
## [1] -0.4779480 -0.9642717  0.2279148 -0.4471080  0.1064011
## 
## [[3]]
## [1] -0.7956522  0.8792829 -0.3071313  0.1094047  0.1135288

The argument simplify=FALSE tells it to return the output as a list. If you set this to TRUE it returns a matrix instead

replicate(n = 3, 
          expr = rnorm(n = 5, mean = 0, sd = 1), 
          simplify = TRUE )
##            [,1]       [,2]       [,3]
## [1,]  0.6787630 -1.3669316 -0.9293416
## [2,] -0.8953333 -0.9607781 -1.9034088
## [3,] -0.6243654  1.0369531 -1.0106750
## [4,]  0.2827911  0.1516815  1.7664099
## [5,]  1.6380746 -1.3831001  1.8825689

Specifying as.data.frame() with the matrix output can turn it into a data frame.

replicate(n = 3, 
          expr = rnorm(n = 5, mean = 0, sd = 1), 
          simplify = TRUE ) %>% 
  as.data.frame() %>% 
  rename(sample_a=V1, sample_b=V2, sample_c=V3)
##       sample_a  sample_b   sample_c
## 1  0.088180361 0.6301284 -1.3532818
## 2  1.542517588 1.5851132 -1.7461287
## 3 -0.294668964 0.7840522 -1.2377223
## 4  0.003613039 0.9598645 -0.2833658
## 5 -0.697784050 1.2285486  0.9498306

14.4 Part 4: repeatedly making whole data sets

This is combining parts 2 and 3 to repeatedly create and sample data sets, resulting in a list of many data sets.

simlist = replicate(n = 3, 
          expr = data.frame(group = rep(letters[1:2], each = 3),
                            response = rnorm(n = 6, mean = 0, sd = 1) ),
          simplify = FALSE)

simlist
## [[1]]
##   group  response
## 1     A 1.3681197
## 2     A 0.1706618
## 3     A 0.1858557
## 4     B 2.2231589
## 5     B 0.8016017
## 6     B 0.6890034
## 
## [[2]]
##   group    response
## 1     A  1.69969710
## 2     A  0.01731652
## 3     A  0.91465872
## 4     B -2.81221913
## 5     B -0.09187946
## 6     B -0.63539913
## 
## [[3]]
##   group   response
## 1     A  0.5521644
## 2     A -0.3389562
## 3     A -0.2929999
## 4     B  0.2710472
## 5     B -1.3079854
## 6     B -1.8641999

14.5 Part 5: Using purrr

See this blog post


  1. Kruschke, J. (2015). Goals, power, and sample size. In J. K. Kruschke (Ed.), Doing bayesian data analysis: A tutorial with r, jags, and stan (2nd ed., pp. 359-398). Academic Press.↩︎