Chapter 14 Creating a simulated data set
From the tutorial on this page
14.1 Part 1: Independent samples from a normal distribution
Consider the following first before you start doing stuff: - How many subjects are in each condition? - What are the means and standard deviations of each group?
Set that shit below.
# number of subjects per group
<- 50
A_sub_n <- 50
B_sub_n
# distribution parameters
<- 10
A_mean <- 2.5
A_sd
<- 11
B_mean <- 2.5 B_sd
Now generate scores for each group
<- rnorm(A_sub_n, A_mean, A_sd)
A_scores <- rnorm(B_sub_n, B_mean, B_sd) B_scores
Technically you could stop here and just analyze the data in this fashion…but its better to organize it into a table. One that looks like something you would import after real data collection.
So do that next; make it look nice.
<- tibble(
dat sub_condition = rep( c("A", "B"), c(A_sub_n, B_sub_n) ),
score = c(A_scores, B_scores)
)
head(dat)
## # A tibble: 6 x 2
## sub_condition score
## <chr> <dbl>
## 1 A 8.50
## 2 A 8.26
## 3 A 13.4
## 4 A 11.6
## 5 A 4.04
## 6 A 9.68
Always perform a quality and consistency check on your data to verify that shit’s ok.
%>%
dat group_by(sub_condition) %>%
summarise(n = n() ,
mean = mean(score),
sd = sd(score))
## # A tibble: 2 x 4
## sub_condition n mean sd
## <chr> <int> <dbl> <dbl>
## 1 A 50 10.1 2.53
## 2 B 50 10.4 2.04
14.2 Part 2: Creating data sets with quantitative and categorical variables
From the web page at this link
14.2.1 2.a. DATA WITH NO DIFFERENCE AMONG GROUPS
Critically important notes to know: When you use the rep()
function, there are several different arguments you can specify inside it that control how stuff is repeated:
- using
rep(x, each= )
repeats things element-wise; each element gets replicated n times, in order
rep(c("A","B"), each=3)
## [1] "A" "A" "A" "B" "B" "B"
- using
rep(x, times= )
repeats the sequence; the vector as a whole, as it appears, will be repeated with one sequence following the next
rep(c("A","B", "C", "D", "E"), times=3)
## [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"
- using
rep(x, length.out)
repeats only the number of elements you specify, in their original order
rep(c("A","B", "C", "D", "E"), length.out=3)
## [1] "A" "B" "C"
In this particular data, we want every combination of group
and letter
to be present ONCE.
=c("A","B","C","D","E")
letters
tibble(group = rep(letters[1:2], each = 3),
factor = rep(LETTERS[3:5], times = 2),
response = rnorm(n = 6, mean = 0, sd = 1) )
## # A tibble: 6 x 3
## group factor response
## <chr> <chr> <dbl>
## 1 A C -0.355
## 2 A D 0.178
## 3 A E 0.610
## 4 B C -0.465
## 5 B D 0.401
## 6 B E -0.727
14.2.2 2.b. Data WITH A DIFFERENCE among groups
What if we want data where the means are different between groups? Let’s make two groups of three observations where the mean of one group is 5 and the other is 10. The two groups have a shared variance (and so standard deviation) of 1.
14.2.2.1 Some notes first
Creating a difference between the two group’s average score means we have to tell R to sample itteratively from distributions with different means. We do this by specifying a vector of means within rnorm
, like so:
= rnorm(n = 6, mean = c(5, 10), sd = 1)
response response
## [1] 2.754076 9.609848 5.285164 9.798403 4.205527 7.954296
You can see that: 1. draw 1 is from the distribution \((\mu=5,\sigma=1)\) 2. draw 2 is from the distribution \((\mu=5,\sigma=1)\)
And this process repeats a total of six times.
And if you happen to also specify a vector of standard deviations (purely to demonstrate what is happening, we won’t actually do this), the first mean is paired with the first SD; the second mean is paired with the second SD; and so on.
rnorm(n = 6, mean = c(5, 10), sd = c(2,0.1))
## [1] 4.987252 9.950492 4.149175 9.974044 2.077211 10.115511
14.2.2.2 Ok, back to creating the data
If you want there to be differences between the groups, we need to change the way the vector of factors is replicated, in addition to specifying the vector of means. We want to ensure that the sequence of “A”, “B” in the group
column matches the sequence repeated in the response
column.
Here we are going to use length.out
so that the whole sequence of A,B is repeated exactly in line with the alternating drawing from \(\mu=5\), \(\mu=10\).
It’s often best to do this by building each thing separately, and then combining it into a tibble when you have it figured out.
=rep(letters[1:2], length.out = 6)
group group
## [1] "A" "B" "A" "B" "A" "B"
=rnorm(n = 6, mean = c(5, 10), sd = 1)
response response
## [1] 4.126291 8.969935 3.874072 9.828078 5.393620 9.125079
tibble(group,
response)
## # A tibble: 6 x 2
## group response
## <chr> <dbl>
## 1 A 4.13
## 2 B 8.97
## 3 A 3.87
## 4 B 9.83
## 5 A 5.39
## 6 B 9.13
14.3 Part 3: Repeatedly simulate samples with replicate()
Instead of drawing values one at a time from a distribution, we want to do it many times. This is a job for replicate()
. What replicate()
does is run a function repeatedly.
The replicate()
function will perform a given operation as many times as you tell it to. Here we tell it to generate numbers from the distribution \(N~(\mu=0, \sigma=1)\), three times (as specified in the n=3
argument in line one)
replicate(n = 3,
expr = rnorm(n = 5, mean = 0, sd = 1),
simplify = FALSE )
## [[1]]
## [1] -0.51287813 0.63520496 0.91589225 0.07678404 0.14043466
##
## [[2]]
## [1] -0.4779480 -0.9642717 0.2279148 -0.4471080 0.1064011
##
## [[3]]
## [1] -0.7956522 0.8792829 -0.3071313 0.1094047 0.1135288
The argument simplify=FALSE
tells it to return the output as a list. If you set this to TRUE
it returns a matrix instead
replicate(n = 3,
expr = rnorm(n = 5, mean = 0, sd = 1),
simplify = TRUE )
## [,1] [,2] [,3]
## [1,] 0.6787630 -1.3669316 -0.9293416
## [2,] -0.8953333 -0.9607781 -1.9034088
## [3,] -0.6243654 1.0369531 -1.0106750
## [4,] 0.2827911 0.1516815 1.7664099
## [5,] 1.6380746 -1.3831001 1.8825689
Specifying as.data.frame()
with the matrix output can turn it into a data frame.
replicate(n = 3,
expr = rnorm(n = 5, mean = 0, sd = 1),
simplify = TRUE ) %>%
as.data.frame() %>%
rename(sample_a=V1, sample_b=V2, sample_c=V3)
## sample_a sample_b sample_c
## 1 0.088180361 0.6301284 -1.3532818
## 2 1.542517588 1.5851132 -1.7461287
## 3 -0.294668964 0.7840522 -1.2377223
## 4 0.003613039 0.9598645 -0.2833658
## 5 -0.697784050 1.2285486 0.9498306
14.4 Part 4: repeatedly making whole data sets
This is combining parts 2 and 3 to repeatedly create and sample data sets, resulting in a list of many data sets.
= replicate(n = 3,
simlist expr = data.frame(group = rep(letters[1:2], each = 3),
response = rnorm(n = 6, mean = 0, sd = 1) ),
simplify = FALSE)
simlist
## [[1]]
## group response
## 1 A 1.3681197
## 2 A 0.1706618
## 3 A 0.1858557
## 4 B 2.2231589
## 5 B 0.8016017
## 6 B 0.6890034
##
## [[2]]
## group response
## 1 A 1.69969710
## 2 A 0.01731652
## 3 A 0.91465872
## 4 B -2.81221913
## 5 B -0.09187946
## 6 B -0.63539913
##
## [[3]]
## group response
## 1 A 0.5521644
## 2 A -0.3389562
## 3 A -0.2929999
## 4 B 0.2710472
## 5 B -1.3079854
## 6 B -1.8641999
14.5 Part 5: Using purrr
See this blog post
Kruschke, J. (2015). Goals, power, and sample size. In J. K. Kruschke (Ed.), Doing bayesian data analysis: A tutorial with r, jags, and stan (2nd ed., pp. 359-398). Academic Press.↩︎