1 Overview

The randomized controlled trial (RCT) is a key tool for any analyst. What separates the RCT from other statistical tools is its ability to define causality. Whereas most statistical techniques can only tell us about correlations, the RCT can tell us about causations, i.e. whether A causes B, or B causes A. As such, the RCT is central to proving what does and does not work for our farmers.

This lesson will outline the design principles of RCTs and how to ensure that a trial is effective and cost-efficient. You should consult both this lesson and AMP lesson 1 together before every trial to aid in decision making.

Note that not all trials will be a full-fledged RCT, there will also be simpler trials run in earlier phases to obtain estimates of logistical feasibility and/or impact. These are described here.

2 Glossary and key terms:

  • Intervention: An intervention is the product or program innovation you are testing in a trial, e.g. repayment incentives, or new seed varieties.

  • Co-variate : Other variables in a study that may correlate with your outcome. E.g. if you are measuring the effect of solar lighting on farmer repayment then maize yield might be an important co-variate that influences our outcome.

  • Cluster: A group of individuals sharing some co-variates. An example at One Acre Fund would be a group or site.

  • Randomisation unit: The level at which randomization (and analysis) occurs, this could be the individual, the group, or the site (or other levels)

  • Outliers: Data points which do not seem to follow the main distribution, note that this is often very subjective.

  • Treatment group: Those individuals or clusters that have been assigned to an intervention

  • Control group: Those individuals or clusters that have been assigned to not receive an intervention

  • Minimum detectable effect: A minimum detectable effect (MDE) is the smallest true effect that has a “good chance” of being found to be statistically significant (i.e. with a power of 0.8).

3 Objectives:

After this lesson you should:

  • Know the steps to take when designing an RCT and the order in which to follow them.

  • Understand what trials are possible given certain resource limitations and decide from there whether a randomized trial is appropriate for the question of interest.

We will break an RCT into it’s parts (hypothesis, randomization, and measurement) and examine what is required from each part to make an effective RCT. This lesson will rely heavily on a familiarity with the concepts explored in AMP lesson 1.

4 Principles of randomized testing

We might start thinking about a randomized controlled trial (RCT) based on a question or idea from a colleague. For example, we might have a hunch that an SMS system improves repayment. With a little bit of work we can take this question and turn it into a hypothesis and an RCT that evaluates the exact gain (or lack of gain) that results from the new SMS system

To turn a vague idea into an RCT we first need to form our question as a hypothesis, we then need to work out our randomization strategy, sample size and finally our method of measurement.

4.1 Hypotheses

A hypothesis is a formal statement describing the relationship you want to test. A hypothesis must be a simple, clear and testable statement (more on test-ability below).

We re-phrase our question of “does an SMS system improve repayment” to two statements, a null hypothesis and an alternative hypothesis:

  • Null hypothesis (H0) : The null hypothesis usually states that there is no difference between treatment and control groups. (To put this another way, we’re saying our treatment outcome will be statistically similar to our control outcome )

  • Alternative hypothesis (H1): The alternative hypothesis states that there is a difference between treatment and control groups. (In other words, the treatment outcome will be statistically different to the control outcome )

Notably, a hypothesis should include reference to the population under study (farmers, groups, sites), the intervention (SunKing Solar Lamps, Monsanto Fertilizer 31B, SMS repayment nudges), the comparison group (what is the alternative you are comparing to), the outcome (what will you measure) and the time (at what point will you measure it).

Population, Intervention, Comparison, Outcome, Time = PICOT. Remember PICOT when defining your hypotheses.

To give an example of a well formed hypothesis:

  • Null hypothesis: Farmers in Kakamega district that receive personalized SMS repayment nudges will not have higher end-of-season percentage repayment compared to farmers that received no SMS.

  • Alternative hypothesis: Farmers in Kakamega district that receive personalized SMS repayment nudges will have higher end-of-season percentage repayment compared to farmers that received no SMS.

Note that the above clearly states our PICOT:

  • Population: individual farmers in the Kakamega district

  • Intervention: personalized SMS repayment nudges

  • Comparison: farmers receiving no SMS

  • Outcome: percentage repayment

  • Time: end of season

We can contrast this with a poorly formed hypothesis that will not be much use in designing or analyzing an RCT:

  • H0: Ladybugs are a not good natural pesticide for treating aphid infected plants

  • H1: Ladybugs are a good natural pesticide for treating aphid infected plants

Why is this so bad? Take a moment to think before expanding the answer below:

4.1.1 Summary

  • An RCT requires a clear and well defined hypothesis that is testable. Often a hypothesis will start with a question (“What do SMS repayment nudges do to farmer repayment”), it is the role of the analyst to convert this question into a robust hypothesis with an agreed upon method of testing.

  • Once we have a hypothesis we need an estimate of the effect size we expect to see . This estimate is directly relevant to sample size calculations (remember, a sample size that yields a low power makes your results unreliable!). Our estimate can come from historical One Acre Fund data or from well-regarded literature (e.g. from peer-reviewed publications).

4.2 Randomisation in RCTs

Once we have a well formed hypothesis we can think about randomization strategies. To extend our example from above, we could randomize our farmers in two ways:

  • Randomly select farmers to receive a repayment related SMS or not

  • Allow farmers to choose whether to receive repayment reminders, and only send SMS to those farmers.

What would be the difference between these two setups? Before we answer this question, let us examine the reasons why we randomize in an RCT.

Randomization in an RCT serves two related purposes:

  • Distributing co-variates evenly

  • Eliminating bias

Co-variates are factors that might influence your outcome variable, for example, farmer location, soil type and farmer risk appetite. Note that some of these are observable and measurable (such as farmer location or soil type) and some of these are unobservable (such as risk appetite, which is difficult to measure).

Statistical bias is when your sample is substantially different from your target population. In an RCT we are assuming our sample is representative of our population, deviations from this assumption can lead us to an incorrect understanding of our population and generating conclusions that look robust but are actually invalid!

Common forms of bias at this stage of RCT design (which also effect our co-variate distributions) are:

  • Randomization bias - bias due to poor randomization resulting in unbalanced Treatment/Control groups (e.g. district X is over-represented in our treatment farmers vs. our control farmers). This allows some co-variates to exert more influence in one group than another.

  • Selection bias - bias would also result if we were to allow farmers to assign themselves to Treatment/Control groups. This is because there may be unobservable co-variates that are associated with a Treatment/Control choice. For example, farmers with more risk appetite and might select themselves into treatment groups more, leading to our T farmers having both a treatment effect and a risk appetite effect.

Both of these will lead to what is called confounding bias, this means it will be difficult to untangle effects that are due to poor randomization and effects that are due to the actual intervention.

However, if each participant has an equal chance of being randomly assigned to a Treatment/Control group then randomization will be free of bias. This will result in both observable (e.g. farmer location) and un-observable co-variates (such as risk appetite) being spread equally to T and C groups. It is this spreading of co-variates that allows us to understand causality.

To revisit our example at the top of this section, our choice between two strategies:

  • Randomly select farmers to receive a repayment related SMS or not

  • Allow farmers to choose whether to receive repayment reminders, and only send SMS to those farmers.

Which of these now seems like a better strategy from a statistical point of view?

It is therefore important to select treatment and control groups totally randomly, the best way to achieve this is by letting R do the work for you. I have included a function below that will do this for you. Please note the use of the set.seed function - we will be using random numbers to assign our groups to Treatment/Control status and so set.seed will make this random-ness reproducible both by you and another analyst.

If we draw three numbers randomly without setting the seed:

rnorm(1,50,10)
[1] 51.1744
rnorm(1,50,10)
[1] 52.15579
rnorm(1,50,10)
[1] 55.62233

Now if we randomly draw three numbers but set the seed to be identical each time:

set.seed(111)
rnorm(1,50,10)
[1] 52.35221
set.seed(111)
rnorm(1,50,10)
[1] 52.35221
set.seed(111)
rnorm(1,50,10)
[1] 52.35221

Let’s now look at the function to assign Treatment/Control status, which I have here called RCT_random:

#lets first make up a fake list of farmer IDS from 1 to 1000 and 1000 random variables drawn from a normal dist. (called yield)
df = data.frame("OAFID"=seq(1,1000), "yield"=rnorm(1000))

#lets now make a function to do the work - you can copy paste this funciton into your own scripts
# it needs to be given a dataframe and a list of naming options
# Options might be "treatment" and "control", or if there are more than 2 options then it might be "Control", "treatment1", "treatment2", or "control", "sunking home", "sunking pro"
#note you will need to copy this bit directly below for it to work in your code
RCT_random = function(dataframey, values_to_add){
  
  set.seed(111)
  dataframey$values_to_add[sample(1:nrow(dataframey), nrow(dataframey), FALSE)] <- rep(values_to_add)
  colnames(dataframey)[which(colnames(dataframey)=="values_to_add")] = "Status"
  return(dataframey) }


# so this will take the dataframe called "df" and randomly assign each ROW to "Treatment" or "control"
df_new = RCT_random(df, c("Treatment","Control"))

# so this will take the dataframe called "df" and randomly assign each ROW to either "treatment1", "treatment2" or "control"
df_new2 = RCT_random(df, c("Treatment1","Treatment2", "Control"))

We have now taken our original dataframe:

head(df)
  OAFID      yield
1     1 -0.3307359
2     2 -0.3116238
3     3 -2.3023457
4     4 -0.1708760
5     5  0.1402782
6     6 -1.4974267

And randomly assigned T and C status to them with my function “RCT_random”:

head(df_new)
  OAFID      yield    Status
1     1 -0.3307359 Treatment
2     2 -0.3116238 Treatment
3     3 -2.3023457   Control
4     4 -0.1708760 Treatment
5     5  0.1402782   Control
6     6 -1.4974267   Control

We should now double-check our randomization to ensure it has proceeded as expected, to do this we can look at the distributions of the most important key variables. It is also possible to run an appropriate hypothesis test to assess whether the distributions are different. For these hypothesis tests we will set our P-value threshold to 0.01 (and not the 0.05 that is commonly used), we will look at why it is important to reduce the threshold in a later lesson.

Let’s now look at the summary statistics for Treatment and Control groups and make sure the “yield” variable is similar:

ggplot(df_new, aes(x=yield, fill=Status)) + geom_density(alpha=.3) + xlab("Yield")

They look pretty similar! So far so good, randomization has been successful.

4.2.1 Quick quiz

Let’s look at some examples of randomization strategies below and try to decide whether they are proper or improper randomization:

  1. Farmers are allowed to decide whether to be in a treatment or control group

  2. Any farmer with a national ID number ending in an odd number is assigned to treatment, any farmer with an ID ending in an even number is assigned to control.

  3. Farmers east of the OAF office are assigned to control, farmers west of the OAF office are assigned to treatment

  4. We flip a coin to decide whether a farmer is control or treatment

Which of the above are truly randomized? Take a moment to think about this before expanding the answer below.

4.2.2 Summary

  • We must properly randomize farmers into treatment and control groups to eliminate confounding bias and ensure that the sample we have (which we actually observe) is representative of the population we care about. Furthermore, we must distribute co-variates equally to be able to draw conclusions about causality.

  • We achieve proper randomization by each and every farmer (or group or site) having an equal probability of being assigned to a treatment group. The best way to do this is by letting R do the work for you and using my RCT_random function above.

4.2.3 Additional materials

4.3 Cluster randomised trials

Sometimes, we won’t be able to randomize at the individual farmer level but we will still want to measure at that level. For example, if we were to randomly assess the impact of FO incentives on farmer repayment then it would not be possible to randomize at the farmer or group levels (as these share FOs) and we would instead need to randomize at the FO level whilst still measuring our farmer level outcome (repayment).

The process of randomizing at one level but measuring at another causes complications in our RCT design. Specifically, the inherent similarity of farmers sharing an FO (in the above example) will lead us to have narrower distributions and under-estimate our errors. This, in turn, will have knock on effects for our study power and our study results (we’ll have a higher rate of false positives). The concept that individuals treated in groups will behave similarly is known as clustering and our solution is to use a cluster randomized trial

A cluster randomized trial is similar to an RCT but our unit of randomization becomes the cluster rather than the individual (so in the above example, the FO/site is the cluster). We can measure clustering with the intra-cluster correlation or ICC which will tell us how correlated the responses of individuals within a cluster are. ICC runs from 0 (no correlation between members of the same cluster) to 1.0 (complete correlation).

ICC can cause our distributions to seem narrower than they really are, this, in turn, will have knock-on effects on our statistical power, coefficients, confidence intervals and p-values. It essentially leads to a false confidence in our results. You should be able to rationalize why under-estimating our errors would lead to these effects using the concepts in AMP lesson 1.

** Before designing a trial it is important to consider whether an RCT or a cluster trial is more appropriate for your hypothesis. If you are unable to randomize at the individual level (but still want to measure at that level) then a cluster trial is likely the correct choice. **

It is possible to calculate the ICC before a trial using historical data on from your outcome variable. This estimated ICC will enable you to adjust your methods as necessary to produce a robust trial.

Let’s first look at how we can calculate ICCs:

# make some fake data

#this data will have a farmer ID, a maize yield (our variable of interest), the group name and the district name
df = data.frame(FarmerID=seq(1,100), Maize_yield=rnorm(100,500,50),  group_name=c("farmers_first","farmers_second","farmers_third","farmers_last"), District=c("A","B"))

library(knitr)
kable(df[1:5,], format="markdown", align="c")
FarmerID Maize_yield group_name District
1 569.7911 farmers_first A
2 455.4007 farmers_second B
3 553.2515 farmers_third A
4 540.0405 farmers_last B
5 481.5963 farmers_first A

We can calculate the ICC using the snippet of code below:

# this is my function. It will calculate the confidence intervals for ICC
#it needs two character strings, x and y, which are the group level and the outcome variable column names in ""
# it also needs the dataframe which has column names x y
ICC_CI <- function(cluster_level,outcomevar, dataf){
  
  #load library
require(ICC)
  set.seed(123)
  si = round(dim(dataf)[1]*0.66)
  values_f <- c()
  for(i in seq(1:50)){
  samp_f = dataf[sample(nrow(dataf), si), ]
  x_f = ICCbare(cluster_level,outcomevar,samp_f)
  values_f <- c(values_f, x_f)
  }
  # note that 1.96StDevs = 95% confidence interval bounds in a normal dist.
   ret = data.frame("Mean ICC" = round(mean(values_f, na.rm=TRUE),3), "CI" = round(1.96*sd(values_f, na.rm=TRUE),3))
   ret$Significant = ifelse(ret$Mean.ICC > ret$CI, "Y", "N")
  return( ret)
  
  }
#using kable to make the dataframe that is returned look pretty
stored_ICC <- ICC_CI("group_name", "Maize_yield", df)
Loading required package: ICC
stored_ICC
  Mean.ICC    CI Significant
1    0.022 0.081           N

We can see from this calculation that our ICC between farmers in the same lending group is \(0.022 +- 0.081\). As the confidence intervals (CI) cross zero we can see that this is not significant (hence “Significant” = “N”).

We can do a similar calculation for district level ICC:

ICC_CI("District", "Maize_yield", df)
  Mean.ICC    CI Significant
1    0.034 0.092           N

Once we know our ICC (and whether it is significant) we have two options which will effect our RCT planning and analysis.

In the RCT planning stage, we want to use ICC to calculate a sample size. The larger the ICC, the less value we gain by interviewing individual farmers in clusters (as they are highly similar). This means that as the ICC increases, we have to add clusters, not farmers, in order to increase our effective sample size. As a trial planner, the implication of this is that we’d have to involve more sites and more FOs in order to run a good trial, rather than interviewing more farmers per FO.

As with any sample size and effect size, we have to ask ourselves if the resource expenditure is justified. Again, as ICC increases in a clustered RCT context, we’ll need to add clusters, not farmers, in order to achieve a usable sample. If you are running a clustered RCT and your outcome data is coming through roster, we can also consider abandoning the cluster RCT approach (where we randomize at cluster level and measure at farmer level) and instead aggregate our individual data by cluster to have cluster level data. Now our unit of randomization (cluster) is the same as our unit of measurement (cluster) and we don’t have to worry about ICC or clustering our standard errors. The implication of this choice though is that our sample size becomes the number of clusters exclusively. If before we had 20 clusters with 5 farmers per cluster and we aggregate our data, we’d now only have a sample size of 20.

I have compared the options of summarizing data by cluster and correcting sample size for ICC below:

4.3.0.1 Option 1: Summarize data

Option 1: Calculate a summary metric for each cluster (e.g. cluster mean). Each cluster then provides only one data point and allows us to continue with the assumption that our data is independent, we can then proceed with standard statistical tools (e.g. T-test) as normal.

So if we have 500 farmers in 45 groups, we end up with 45 data points. This means that our power, sample size and analysis calculations also need to be carried out at the cluster level. It also means that we can simply analyse our data at the cluster level and ignore ICC from here on out.

If we wanted to use this option then we would need to summarize our data by cluster. Let’s work through this with some simulated data.

Our simulated data has \(100\) rows in total:

head(df)
##   FarmerID Maize_yield     group_name District
## 1        1    569.7911  farmers_first        A
## 2        2    455.4007 farmers_second        B
## 3        3    553.2515  farmers_third        A
## 4        4    540.0405   farmers_last        B
## 5        5    481.5963  farmers_first        A
## 6        6    513.6648 farmers_second        B

Now let us summarize it by the cluster, which here is group_name. I use the dplyr library to summarize the data (cheatsheet here). A couple notes on the code:

  • We need to load dplyr to access the %>% function.
  • We call %>% “piping”. We can pipe an object to a new function to simplify work and make the code easier to read.
  • We indent after each %>% to create a stack of actions that we are applying to the data. To learn more, read the dplyr vingette
  • So below we are taking df and piping it to a function that will group our data by group_name. Once we have grouped our data we will pipe it again to summarize() which will calculate the mean of the Maize_yield column in our dataframe. This final object is then saved as sum_df
library(dplyr) 


sum_df <- df %>% group_by(group_name) %>% summarise_each(funs(mean))

#kable makes a pretty table in r markdown HTMLs
kable(sum_df, format="markdown", align="c")
group_name FarmerID Maize_yield District
farmers_first 49 491.3433 NA
farmers_last 52 500.4730 NA
farmers_second 50 517.1783 NA
farmers_third 51 498.8382 NA

We now have the average maize yield for each of our four groups (“farmers_first”, “farmers_second”, “farmers_third”, “farmers_last”). Note that we also have the “average” ID number, which is obviously nonsense, also note that our non-numeric columns (like District) have been replaced with NAs (as we cannot average characters). Finally, be aware that if one of our numeric columns were set to class “character” or class “factor” rather than class “numeric” then it would also fail to average them.

Note that we can also manually manipulate our data if we want to specifically create new metrics and preserve character/factors:

#library(plyr) 

#this will take df and group it by "group name". It will then create a new metric called "average_yield" which is the mean of maize_yield. Likewise it will create a column called "stdev_yield" that will have our standard deviations

sum_df2 <- df %>%
  group_by(group_name) %>%
  summarize(
    average_yield= mean(Maize_yield), 
    stdev_yield= sd(Maize_yield), Districts = unique(District)
  ) %>% 
  as.data.frame()

kable(head(sum_df2))
group_name average_yield stdev_yield Districts
farmers_first 491.3433 48.54526 A
farmers_last 500.4730 46.27466 B
farmers_second 517.1783 44.13352 B
farmers_third 498.8382 44.63683 A

We would now use this new data frame to calculate our sample size and power, as examined in AMP lesson 1. If we were doing this post analysis then we would use this dataframe to conduct our hypothesis tests and regressions.

4.3.0.2 Option 2: Use the ICC

Use the ICC with the number of individuals to work out our new sample size.

For option 2, we will continue to calculate power, sample size and analysis metrics at the individual level but with some corrections to account for the falsely narrow distributions.

For our sample size, we will inflate it with the ICC_correction function below:

ICC_correction <- function(samplesize, num_clusters, ICC_estimate){
  
  
  
  average_cluster_size = samplesize/num_clusters
  
  factor_inflate = 1 + (average_cluster_size - 1) * ICC_estimate
  
  return(data.frame("New sample size"=round(samplesize*factor_inflate), "Old sample size"=samplesize   ))
  
  
}

ICC_correction(200, 50, 0.2)
##   New.sample.size Old.sample.size
## 1             320             200
new_sample_size <- ICC_correction(200, 50, 0.2)$New.sample.size

So an initial sample size of 200 farmers, with 30 clusters and an ICC of 0.2 would lead to a new sample size of \(320\).

Note that adding additional clusters (rather than new farmers in existing clusters) is a more efficient way to increase statistical power without massive adjustments to sample size. For example:

scenario1 <- ICC_correction(200,20,0.2)$New.sample.size #average 10 farmers per cluster, and  20 clusters

scenario2 <- ICC_correction(200,40,0.2)$New.sample.size # average 5 farmers per cluster, and 40 clusters

We can see that halving the ICC leads to a reduction in corrected sample size from scenario1= 560 to scenario2= 360.

I have made a function below that will plot the relationship between adding farmers from new clusters vs. existing clusters. The function takes the original (non corrected) sample size and the number of clusters as arguments, and then plots the scenario where farmers are added to an existing cluster vs new clusters. You can use this function to examine the expected gains from adding farmers at various levels, and to gauge how to cost-effectively increase sample size.

# starting size is the uncorrected sample size
# n clusters is total number of clusteres
# ICCval is calculated ICC
# add size size is the size of new clusters to add you want
# maxim limits the number of new farmers added 

ICC_groups <- function(starting_size, nclusters, ICCval, add_size, maxim){
  
  add_c <- c()
  expand_c <-c()
  
  xo = seq(1:maxim)*add_size
  for(i in seq(1:maxim)) {
    i=as.numeric(i)
    starting_size  <- as.numeric(starting_size)
    nclusters <- as.numeric(nclusters)
    sizes = starting_size+(i*add_size)
    clustersplus = nclusters+i
    add_c[[i]] <- ICC_correction(sizes,nclusters+i,ICCval)$New.sample.size
    expand_c[[i]] <- ICC_correction(sizes,nclusters,ICCval)$New.sample.size
  }
  
  library(ggplot2)
print(ggplot() + geom_line(aes(x=xo,y=add_c,colour="Clusters")) + geom_line(aes(x=xo,y=expand_c,colour="Individuals")) + xlab("Added farmers")+ scale_colour_manual("", breaks = c("Clusters", "Individuals"),values = c("red", "blue")) + ylab("Corrected sample size") + ggtitle("Sample size inflation") )

  return(list(add_c,expand_c))
}




y = ICC_groups(200,40,0.1,5, 50)

In general, we can see that adding new clusters rather than new farmers to existing clusters, is much more efficient.

In later lessons, we will look at how to robustly analyse data with ICC.

4.3.1 Summary

  • Cluster-level effects can compromise an RCT. It is important to randomize and analyze at the correct level!

  • You can estimate the ICC using historical data and use my functions to correct sample sizes.

  • Whenever you are not randomizing at an individual level, you need to be aware of ICC!

4.3.1.1 quick quiz

Use the data set here to answer the following question. Only expand my answer and code once you’ve had a go:

  • What is the ICC for farmer repayment at the group level?

  • Using this ICC, if my initial power calculation indicated a sample size of 200, what would the new (ICC corrected) sample size be? Note that you will need the average cluster size for this calculation!

Only expand the below once you’ve had a go, please email your code to me (Mike.Barber@).

bin <- read.csv("dataset.csv")

#get ICC
ICC_CI("unique_ID","Repaid_percent",bin)

# get group sizes
x <- bin %>%
  group_by(unique_ID) %>%
  summarize(
    len= length(OAFID)) 



ICC_correction(200, mean(x$len) , 0.17)

4.4 Setting a p-value threshold for analysis

Once we have a testable hypothesis and a method for measurement (see below), we will use a hypothesis test to detect differences in the underlying populations represented by our control and treatment samples (see AMP lesson 1). A hypothesis test (of which there are many) will yield a P-value, which is the probability that our data could generated purely by chance - in other words, the probability that we wrongly reject H0 (a false positive result).

When using a hypothesis test we must set an acceptable rate of false positives, also known as the p-value threshold or alpha level. The most common p-value threshold is 0.05. This means that we are willing to accept a 5% risk of generating a false positive and wrongly concluding that there is a difference between our treatments when in fact there is not.

We can see this below, if we randomly test two samples from identical populations then 5% of the time we will mistakenly identify the populations as different:

plist = c() # empty list 
#simulate many many draws of samples from a population
for(i in seq(1:100000)){
  r1 = rnorm(200,50,10)
  r2 = rnorm(200,50,10)
  p = t.test(r1,r2,var.equal = TRUE)$p.value #calculate a pval
  plist = c(p,plist) # append the pval to the list each time
  
}

#lets plot a very specific histogram
#i want to plot my list of pvalues, but I want each bin to be 0.05 long. So we should end up with 20 bins evenly spaced from 0 to 1
h <-  hist(plist,  breaks=c(0,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1.0))
#now I want the y axis to be %, so we will divide the count of each bar by the total count
h$density = h$counts/sum(h$counts)*100
# now lets plot it 
plot(h,freq=FALSE, main="P-values derived from 2 identical populations", xlab="P-value", ylab="Frequency (%)", col="black", border="white", labels=c("<0.05", rep(NA,19)), ylim=c(0,6) )

In some cases we might want to set a threshold of 0.01 (1%) or 0.1 (10%). Generally speaking, the more unwilling we are to be incorrect, the lower the threshold. So for an intervention that might have adverse effects on farmers, we would want to be very sure of the positive effects (0.01 threshold) and unwilling to accept negative effects (0.1 threshold). Note that the lower our threshold the larger the sample size needed to detect any effect.

It is possible to skip a hypothesis test and proceed directly to regression results (lesson 3). However it is a good sanity check to make sure that our regression results match our hypothesis test in an RCT. If these results do not match then it suggests that our RCT was designed poorly. We will cover hypothesis testing in analysis next lesson, for now it is enough to know that you must specify your p-value threshold before a trial is executed.

4.4.1 Summary

  • We must also set a threshold for our P-value. This threshold represents the level of risk we are willing to accept in being wrong. Note that being more stringent (i.e. lower thresholds) will require a larger sample size!

  • Note that this p-value threshold will effect our power and sample size calculations! the pwr library we saw in lesson 1 will take p-value threshold (aka alpha level) as an argument. A lower p-value threshold requires a higher sample size!

4.5 Sample size calculations in practice

Combining the materials from this lesson and AMP lesson 1 gives us a good theoretical foundation for sample size calculations. However a common question I want to address here is: “where do I get the data from for a sample size calculation?”. We can use the SMS trial example from above to understand how to answer this question.

The two main pieces of information needed before a sample size calculation can be carried out are:

  • Historical data
  • Estimates of intervention effect

Historical data might be provided by Roster, or from survey baseline data. If those are unavailable then you might simulate data based on metrics from peer-reviewed literature. The estimate of intervention effect is a little more difficult, and there are two ways we can approach it:

4.5.0.1 Estimates from phase 0 or phase 1 trials

The first of these involves literature review and/or a small pilot study to estimate the differences between treatment and control groups. Note that if you are using phase 1 data, then the estimate of effect size is likely to be very rough, so I recommend halving the effect size to be conservative and using that for calculations. If you are using literature values then again, treat them cautiously and consider any literature value to be an upper estimate of effect size (the reasoning here will be explained in lesson 6).

4.5.0.2 Minimum detectable effect (MDE)

The MDE approach will ask “what is the minimum effect that I would need to see for the intervention to be worthwhile?”, we would then set the effect size to that. The reasoning here is that if the return on investment (ROI) of a proposed intervention is negative (or very small) then we don’t need to be able to precisely measure it to make a decision. For example, if we trial the effect of solar lamps on farmer expenditure, then it might be that only a reduction of $5 or greater per month is worthwhile, and anything below $5 would be a negative investment. In this case, we can use $5 as our MDE and $5 as our effect size for calculations. This then rephrases our effect size and RCT to: “we will have an X% chance of detecting an effect of $5 or greater” (where X is the power with an MDE of $5).

The MDE approach can be very powerful for product innovations trials where we only care if a product alters the treatment group by a certain magnitude or mode. Use it wisely during both pre-analysis (to estimate sample size) and post-analysis (AMP lesson 3) to say what size effect we would have been able to detect with a power of 0.8. You can see a worked example below for pre-analysis MDE, I have also included a function plot_MDE which you can use on your own data.

#make some fake data, let's pretend its baseline data

dat <- data.frame("av monthly expenses" = rnorm(1000,100,10))



#this function will plot MDE for various differences
# differs is a list of intervention effects that you want to consider
plot_MDE <- function(historical_data, differs){
  
  #initialise empty vec
p <- c()


#remember our effect size function from lesson 1?
cohen_d <- function(d1,d2) {  
  m1 <- mean(d1, na.rm=TRUE)
  m2 <- mean(d2, na.rm=TRUE)
  s1 <- sd(d1, na.rm=TRUE)
  s2 <- sd(d2, na.rm=TRUE)
  spo <- sqrt((s1**2 + s2**2)/2)
  d <- (m1 - m2)/spo
  rpb <- d / sqrt((d**2)+4)
  ret <- list("rpb" = rpb, "effectsi" = d)
  return(ret)  } 

#load libs
require(pwr)


for(i in seq(1:length(differs) ) ) {
  samp1 <- historical_data
  xnu = differs[[i]]
  #this is a better version if you can understand it:
    samp2 <- samp1 + rnorm(length(samp1), xnu, xnu/10) #add some noise
  inp <- cohen_d(samp1, samp2)

  
  p[i] <- pwr.2p.test(h=inp$effectsi , sig.level=0.05, power=0.8, n=NULL)$n
  
}

require(ggplot2)
print(ggplot() + geom_point(aes(x=p, y= differs), size=3, color="blue", shape=1) + geom_line(aes(x=p, y=differs), size=1.2, color="blue") + xlab("Sample size")+ ylab("MDE") + ggtitle("Minimum detectable effect vs. sample size"))

library(knitr)
mde_tab = data.frame("MDE"=differs, "Sample size"=p)
kable(mde_tab, digits=2) 


}

#set some differences for the loop, here, 1,2,5 (etc) are dollar increases
diffs <- c(1,2,5,7.5,10,15,20,25)

#get histo data
histo <- dat$av.monthly.expenses

#plot
plot_MDE(histo, diffs)
Loading required package: pwr

MDE Sample.size
1.0 1619.61
2.0 401.55
5.0 64.28
7.5 28.58
10.0 16.25
15.0 7.22
20.0 4.07
25.0 2.66
#table

Note that this function will give you the Y-axis in whatever units you gave to the function (in this case, US dollars). You can use the MDE to calculate an initial sample size and then use the ICC correction to obtain the ICC-corrected sample size for that MDE.

Note also that you can use the MDE for adoption rates, here the differs argument would consist of increases in adoption rates to be plotted.

4.5.1 Summary

  • The MDE is a powerful tool for prioritizing trials and calculating ROI/impact savvy sample sizes

  • See this article for more details on MDEs

4.6 Measurement

The final part of the RCT is the measurement itself. This is often a neglected part of RCT design and therefore usually a major source of later analysis issues. Any hypothesis needs to have a good measurement strategy to be useful.

I want to take a tangent to describe some analysis I did on an RCT looking at airtime usage in Malawi for a telecommunication company. The RCT examined the effect of an intervention on airtime usage. During the RCT, we asked people to recall how much money they had spent in the last month on airtime, we also had data from the telecommunications company on the actual amount spent by the same customers. When we compared the two metrics, we found there was very little correlation! Furthermore, when we looked at who was most likely to over-estimate their airtime spend, we found it was young, urban males.

Now consider - had we only had the self-reported data, we would have thought that young urban men were big spenders on airtime and drawn many conclusions from this and found many “statistically significant” (in terms of P-values of < 0.05) relationships! We would have made many recommendations to the telecommunications company and all our data would have held up to statistical interrogation. In short - there would have been very little way to know we were wrong, but we would have been wrong none-the-less!

The moral of this story is that the best statistics in the world will not save a trial from poor measurement. The other moral is that self-reported data is often terrible, as it is influenced by how people want to be perceived ( perhaps by the interviewer, by the community, or by their own embarrassment about “low” airtime spends).

Now, if we were trying to understand the effect of increased yield on farmer food spending, how might we go about it?

Our first ideas might be:

  • Weekly enumerators visits to trial participants to ask them to recall their last weeks spend on food.

  • Daily spend diary kept by the farmers outlining where they spend their money every day. We then collect the diary after the trial

But both these ideas might suffer from the same bias as my airtime study anecdote. This kind of bias is known as measurement bias and might arise because some farmers are more forgetful than others, or because some farmers feel social pressure to inflate/deflate their numbers. In this case it requires some careful thought about how to obtain meaningful data.

How would you improve this trial so the data was more robust (feel free to email ideas to me at Mike.Barber@)?

My thoughts so far:

  • We could try to get mobile money data from the farmer or from a third-party (in countries where mobile money is common)

  • We could have the enumerator visit farmers twice and then see if the answers match

  • We could compare answers given to enumerators with third-party data, such as government census data.

There would obviously be trade-offs here between statistical robustness and the expense of a study, hiring enumerators to visit every farmer twice would double their workload. The role of the analyst is to be cognizant of these difficulties, but also to advocate for statistical robustness.

There are no easy answers to this question, and the solution is likely to be highly dependent on the question being asked and the country it is asked in. You can seek advice from:

  • your local M&E or PI colleagues to learn about previous questions and measurement methodologies
  • the PRB (peer.review@) to access centralized expertise on sound measurement practices
  • reviews (and meta-analyses especially) of RCT literature

It is vital to spend some time thinking about the caveats and weaknesses of measurement and then to try to mitigate those weaknesses as much as possible.

4.6.1 Pre-test, Pre-test and Pre-test again

Once we have a robust question, and a strategy to measure answers, we will want to pre-test the survey.

I cannot emphasize the importance of pre-testing questions enough. Pre-testing is the process through which we draft questions, ask real farmers those questions and assess the extent to which the question elicited the intended response. When pre-testing, we want to consider:

  • Are the questions on the survey understood easily by farmers?

  • Are the questions interpreted the same way every time? e.g. “How much fertilizer do you use?” can be answered in terms of Kgs or money. We want to make sure we only get one type of answer!

  • Do we want to provide limits on acceptable answers? e.g. “How many acres do you own?” with an answer of “100” is likely to be an error. We can set limits on answers to prevent these sorts of errors.

Pre-testing allows us to identify questions that seem clear to us as analysts but are understood completely differently by farmers. Pre-testing will also allow us to derive a list of responses that is comprehensive. So if we ask:

  • “What livestock do you keep?”

We can make sure that the options (cows, pigs, goats, chickens, other) are available to minimize the amount of “other” that we get.

The final key to great measurement is knowing how the data will be collected and stored. Will you send out paper surveys? If so what happens to the surveys after data entry, and how do we double-check data entry? Once we have the electronic data, where will it be stored so that it is accessible to future One Acre Fund staff? If you are collecting data on a tablet, will you use CommCare?

At a minimum, I would expect all data to be well organized and available on the One Acre Fund Google drive folder. Likewise I would expect all analysis scripts to be deposited in the same folder and clearly linked to the raw data (we will cover this more in future lessons).

Once we have a strong hypothesis and a well defined measurement strategy, we will want to make sure that the data we receive is reliable and robust, this will require periodic data checking during the early stages of a trial. You can see Matt Lowes notes on data scrutiny here.

4.6.2 Summary

  • Think about how to get robust and reliable data. Think about how to make sure we can trust the data.

  • If possible, test your hypothesis with multiple measurements of the same object, e.g. self-reported spend and data from mobile money platforms.

  • Pre-test surveys to find bugs and errors! Even if you think it is perfect, pre-test!

  • Have a clear plan for where to store raw data and R/Stata scripts. We need this to be accessible to other One Acre Fund analysts to preserve your work for future staff!

  • Remember you can use the Peer Review Board ( peer.review@ ) to talk to someone about good measurement practices.

  • Use CommCare validation conditions to ensure that the data collected is reliable!

4.7 Data checking

During the course of the RCT it is important to periodically check data as it comes in. This will enable you to identify any issues early on and (hopefully) fix them before they become chronic. During data checking we are looking to see whether our survey is capturing the data expected. Some real One Acre Fund examples of where this has been useful:

  • Land size measurements by enumerators showing a total acreage of 0.001 acres. This is likely an enumeration or data validation error.

  • Questions about fertilizer use being answered both in kilograms and Kenyan shillings.

In these cases we are really inquiring about outliers. An outlier can either be a true outlier (i.e. maybe a farmer really does have 0.001 acres) or a data entry error by an enumerator. We are concerned about both here. Note also that it is difficult to tell the difference between a true outlier and a data entry error without following up with the farmer/enumerator.

An outlier is identified by lying outside the main distribution. We can visualize data distributions and outliers with a beeswarm plot, note the potential outliers at the top of the plot:

set.seed(112)
#make some data
df = data.frame(ID=seq(1,100), vals=rnorm(100,100,100))
#add some outliers
df$vals[98:100] = rnorm(3,1000,1000)


library(ggbeeswarm)
ggplot(df) +  geom_beeswarm(aes(x=1,y=vals)) + ggtitle("Beeswarm of simulated data") 

We can also try to define outliers mathematically, the boxplot uses a simple formula to identify potential outliers, I have included a function called find_outliers below that will use this logic to find outliers for numerical values from many distribution types. The function will use the inter-quartile ranges to define a cutoff. It will then present the suspected outliers to the user so that a decision can be made on what to do:

#this function expects a vector of numeric values
find_outliers <- function(data_to_test){
  

      IQR<- quantile(data_to_test, 0.75, na.rm=TRUE)[[1]] - quantile(data_to_test, 0.25, na.rm=TRUE)[[1]]
      cutoff = subset(data_to_test, data_to_test <=  quantile(data_to_test, 0.25, na.rm=TRUE)-(IQR*1.5) )  
      cutoff2 = subset(data_to_test, data_to_test >=   quantile(data_to_test, 0.75, na.rm=TRUE)+(IQR*1.5) ) 
      ret = c(cutoff,cutoff2)
      return(ret)
}

find_outliers(df$vals)
## [1] 2111.791 1077.596 1805.064
#use this to find the index position of these outliers
pos <- which(df$vals %in% find_outliers(df$vals) )
df[pos,]
##      ID     vals
## 98   98 2111.791
## 99   99 1077.596
## 100 100 1805.064

Note that we can apply this function to many columns in a dataframe with the below code. In practice we would want to limit these checks to our key variables to prevent field follow-ups (which are expensive) for minor variables:

lapply(df, find_outliers[,c("vals")])
#you can add column names to the list above (e.g. c("vals","vals2") to see more than one column results

These tools can help us identify potential outliers but cannot definitively tell us whether a data point is real or not. When we assign something as an outlier we are making a big assumption about the underlying distribution of data and what we think it should look like. There is always a danger that our assumption is wrong or that we have biases leading us to incorrectly label things as outliers.

If we are confident that a value is mistaken (and not just a true extreme) then we can replace that value with NA, to indicate that it is missing. This will mean subsequent calculations will be done without that value. Note that this should be done in a separate script to ensure that future analysts can easily see the raw data, your cleaning steps, the cleaned data and then the final results.

Outlier detection will often require contextual knowledge to identify true outliers, for example, it is highly likely that a farmer reporting 10 000 Kg of fertilizer per acre is actually a case of enumerator error.

You can read a much more thorough treatment of outliers here. I strongly recommend you read this after the lesson materials.

4.7.1 Summary

  • Data checking ensures that come analysis time, our data is free from outliers and issues

  • Outliers are often difficult to define - remember the 68-95-99.7 rule from lesson 1. In a normal distribution of values we expect 68% of the values to be within 1 stdev of the mean. 95% to be within 2 stdev and 99.7% to be within 3 stdev of the mean. This means we can have fairly extreme values that are still well within what we expect for a normal distribution.

  • Set-up data checking scripts in advance of the RCT. Running an RCT is a busy and hectic time so having these pre-prepared will make life easier and reduce the errors introduced by doing this under stress!

4.8 Lesson Summary

We have now seen the key parts of an RCT:

  • Hypothesis: A good hypothesis is testable and measurable. It must have a clearly defined evaluation criteria (e.g. are we measuring the average or the median? At the individual or group level?).

  • Randomization: We must randomize using R and ensuring our Treatment and Control groups are balanced. We can use my RCT_random function to achieve this. We must also remember cluster effects and use the correct unit of randomization. It is possible to calculate the ICC and adjust sample size accordingly, it is also fair to assume an ICC of 1.0 and have the sample size calculated from the number of clusters (i.e. calculate the number of groups rather than individuals).

  • Power: Know your statistical power, sample size and MDE before and after a trial - see AMP lesson 1.

  • Measurement: Measurements must be robust and reliable, think about the ways that we might be inaccurate and try to minimize the important of self-reported metrics in any study you design!

  • Pre-test: Pre-test all questions with real farmers to understand how they understand your survey and look for difficulties. Remember - a poor survey is due to the analysts bad communication, not the farmers bad interpretation!

Combining lesson 1 (distributions, powers, p-values) and lesson 2 (RCT principles) should now mean you are able to start designing RCTs! Our next few lessons will look at how to analyse data effectively using hypothesis testing and regressions.

5 Homework task

One Acre Fund wants to trial a new, expensive, maize fertilizer. This fertilizer has not been tested in any OAF country yet and you are the first analyst attempting to test the fertilizer’s impact on farmer yields. The fertilizer costs $42 per Kg, compared to our current fertilizer which costs $14 per Kg. Due to logistical constraints, this trial needs to be randomized at the FO level.

Using the data here, write a short (1-2 page) summary outlining a proposal for this trial to submit to the ART Peer Review Board (peer.review@, a body that helps with trial design). You can submit either a word document, a pdf or a R-markdown produced file if you are comfortable with that.

The outline should include:

  • Statement of hypothesis
  • Illustrate the data clearly and succinctly using the techniques covered in lesson 1 (e.g. beeswarms, density plots, histograms etc)
  • Selection of a significance level and rationale
  • Select an MDE level and rationale
  • MDE and power curves
  • ICC calculations and proposed randomization strategy

Once you are happy with your outline, please email it (along with the code) to Mike.Barber@.

The following cheatsheets might help you:

6 Next lesson:

Regressions, mixed effect modelling and hypothesis testing, here