1 AMP lesson: RCT analysis I

1.1 Lesson objectives:

Strong analysis of an RCT is vital to our mission to improve the livelihoods of small-holder farmers. It is through robust RCT analysis that we are able to determine the effectiveness of proposed interventions and make decisions whether to scale them or not. This means that RCT analysis is one of the most important tasks at One Acre Fund and should be treated with the care it deserves.

In this lesson we will work through an example analysis of real OAF data, conducting hypothesis tests and regression analysis on data recorded at the farmer level but clustered at the Site level. We will specifically focus on continuous variables in this lesson before moving on to categorical variables in lesson 4.

The OAF trial I have chosen is the non-compulsory maize trial conducted in Kenya in 2017

We will here focus on only 4 of the hypotheses outlined: maize adoption, planting acreage, loan sizes and cookstove adoption. This lesson has two main aims:

  • To critique the RCT design as an illustration of common design pitfalls, and to show how a weak design impacts the usability of results

  • To analyse the data as best as we can, covering the key analysis tools required for continuous variables

I will explain the steps we are taking and the logic behind them, after this lesson you should be able to conduct your own analyses of RCT data (continuous variables) as well as interpret and critique analyses produced by other analysts within One Acre Fund.

NB. Whilst this lesson explicitly relates to RCT analysis, a lot of the tools introduced will also be applicable to non-RCT data (e.g. regression sections).

1.2 Glossary

  • Continous, categorical, ordinal variables - types of variables common in data analysis (see link)

  • Residuals - the residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest

  • Hypothesis test - in hypothesis testing, two statistical data sets are compared, or a data set obtained by sampling is compared against a synthetic data set from an idealized model. A hypothesis is proposed for the statistical relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis that proposes no relationship between two data sets. The comparison is deemed statistically significant if the relationship between the data sets would be an unlikely realization of the null hypothesis according to a threshold probability-the significance level. Hypothesis tests are used in determining what outcomes of a study would lead to a rejection of the null hypothesis for a pre-specified level of significance.

  • Randomized controlled trial (RCT) - A randomized controlled trial is a type of scientific experiment which aims to reduce bias when testing a new treatment. The people participating in the trial are randomly allocated to either the group receiving the treatment under investigation or to a group receiving standard treatment as the control. Randomization minimizes selection bias and the different comparison groups allow the researchers to determine any effects of the treatment when compared with the no treatment (control) group, while other variables are kept constant.

  • Statistical bias - statistical bias arises when the expected value of the estimator is not equal to the population parameter.

  • Dependent & independent variables - The dependent variables represent the output or outcome whose variation is being studied. The independent variables represent inputs (or predictors).

2 Analysis of RCTs - part I

The Non-Compulsory Maize Trial was run during enrollment for the 2017 core program in Kenya. The trial aimed to determine the effects of One Acre Fund’s shift in policy to make maize an optional rather than compulsory product for clients, a change that was first implemented in the 2016 season. In this trial, the treatment group had to purchase at least a 1/4 acre of maize in order to be eligible to enroll in the program.

Some of the specific hypotheses laid out in the pre-analysis plan included:

  • H1 Making the maize product compulsory will result in higher maize adoption (measured as average maize acres per farmer)

  • H2 Making the maize product compulsory will result in higher maize adoption (measured as and the percent of farmers who chose to purchase any maize)

  • H3 Making the maize product compulsory will result in higher transaction size (measured as the average client loan size)

  • H4 Making maize compulsory will result in higher cookstove adoption

We can see that these hypotheses are a little weak, by using the PICOT specification outlined in AMP lesson 2 we can rework them to be more precise. Note that these hypotheses also imply that we only care about “higher than” relationships, when in all likelihood, we care about both “higher-than” and “lower-than” relationships (this will be important later). With a little free-license we can re-interpret these and add the necessary details:

  • H1 Making the maize product compulsory will result in higher or lower maize adoption (measured as average maize acres per site)

  • H2 Making the maize product compulsory will result in higher or lower maize adoption (measured as and the percent of farmers who chose to purchase any maize in a site)

  • H3 Making the maize product compulsory will result in higher or lower transaction size (measured as the average client loan size per site)

  • H4 Making maize compulsory will result in higher or lower cookstove adoption (measured as the percentage of farmers with 1+ cookstoves per site)

2.1 Practical: loading and inspecting data

Let’s now load the data, which you can download here, and examine it with the ‘str()’ command.

dat <- read.csv("lesson-3-data.csv")


paste("Number of rows = ", dim(dat)[1])
[1] "Number of rows =  22396"
paste("Number of cols=", dim(dat)[2])
[1] "Number of cols= 13"
print(str(dat))
'data.frame':   22396 obs. of  13 variables:
 $ OAFID               : int  43141 42509 44278 44284 44280 44283 44282 44271 44268 44272 ...
 $ solar               : int  0 1 1 1 1 1 1 1 0 1 ...
 $ GroupName           : Factor w/ 2290 levels "Abalibayo","Abeingo",..: 1484 2032 1583 1583 1583 1583 1583 1484 1484 1484 ...
 $ X..Repaid           : num  27.1 3.5 49.3 27.9 16.8 26.1 20.4 19.4 26.1 10.8 ...
 $ TotalEnrolledSeasons: int  2 2 1 1 1 1 1 1 1 1 ...
 $ NewMember           : Factor w/ 2 levels "False","True": 1 1 2 2 2 2 2 2 2 2 ...
 $ Treatment           : Factor w/ 2 levels "C","T": 2 2 2 2 2 2 2 2 2 2 ...
 $ DistrictName        : Factor w/ 4 levels "Kabondo","Kakamega (South)",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ SiteName            : Factor w/ 133 levels "Achuth","Ainapngetuny",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Maize.acres         : num  0.25 0.5 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 ...
 $ Cookstove.qty       : int  1 0 0 0 0 0 0 0 0 1 ...
 $ Maize.adoption      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ TotalCredit         : int  7520 14065 8775 10215 13795 7870 11260 14895 5920 14650 ...
NULL

The str() command provides us with a wealth of information on our data. First, it prints all the column names so we known how to easily access our data, secondly, it prints the data type, which will generally be one of the following:

  • num - Numeric, a number with a decimal point

  • int - Integer, a number without a decimal point

  • char - Character, a string of letters

  • Factor, a single option (aka level) drawn from a table of values (aka levels).

  • bool, boolean - a simple TRUE or FALSE response (note that R will treat TRUE and FALSE as 1 and 0, allowing easy summing, averages, etc. )

Finally it prints a few example values so we can see the data we are dealing with. For example in the above printout it has given us samples of OAFID. Note that data types are a best-guess by R, and might need to be set manually by the analyst.

The str() result indicates that treatment status is a factor variable associated with the column name Treatment and that the levels areT and C.

The pre-analysis plan states that this data was collected at the farmer level, but randomized at the site level. This statement should alert you to the fact that ICC (intra-cluster correlation, lesson 2) will be important here.

If you think back to lesson 2, you will remember we suggested 2 ways to deal with ICC data;

  • Summarize the data by cluster (and then treat it like any other data)
  • account for ICC through cluster-robust methods

We will examine each of these methods in turn. Starting with the simplest method, which is to merely summarize the data by site and then proceed as usual. The summarized data will also allow us to gently introduce hypothesis testing and regression tools before moving on to ICC corrections (which are a little more complex, and require a basic understanding of regression first).

2.2 Practical: summarizing data by cluster

When we summarize the data we must make our unit of inference match our unit of analysis. So let us summarize the data into sites whilst retaining any character columns:

#new function will return a unique name or a NA if there is more than one unique value
uniqueorna <- function(x) {
  if (length(unique(x)) == 1) {
    unique(x)[1]
  } else {
    NA_character_
  }
}

library(dplyr, quietly = TRUE)

#using the %>% is known as piping. It basically sends the result from one function to the next stage in the pipe

dat$solar <- as.numeric(dat$solar)

#summarise the numeric columns to get their means
ndat <- dat %>%  
    group_by(SiteName) %>% 
      summarise_if(.predicate=is.numeric, .funs=mean)

#summarise factor columns to get unique or NA returned
cdat <- dat[c("SiteName","DistrictName","Treatment")] %>%  
  group_by(SiteName) %>%  
    summarise_if(.predicate=is.factor, .funs=uniqueorna)

#bind together to new summarised dataframe
sdat <- merge(ndat, cdat, by="SiteName")

str(sdat)
'data.frame':   133 obs. of  11 variables:
 $ SiteName            : Factor w/ 133 levels "Achuth","Ainapngetuny",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ OAFID               : num  16652 2908 43569 43280 16308 ...
 $ solar               : num  0.714 0.25 0.561 0.605 0.7 ...
 $ X..Repaid           : num  29.1 29.1 32.8 29.4 29.7 ...
 $ TotalEnrolledSeasons: num  1.88 1.29 1.32 1.33 1.91 ...
 $ Maize.acres         : num  0.431 0.873 0.397 0.296 0.424 ...
 $ Cookstove.qty       : num  0.00529 0.0098 0.14035 0.02721 0.01111 ...
 $ Maize.adoption      : num  0.942 0.936 0.886 0.816 1 ...
 $ TotalCredit         : num  10068 12045 10509 8207 10238 ...
 $ DistrictName        : Factor w/ 4 levels "Kabondo","Kakamega (South)",..: 3 4 1 1 3 3 1 3 4 4 ...
 $ Treatment           : Factor w/ 2 levels "C","T": 1 1 2 2 2 1 2 1 2 1 ...

Great. Now our data is summarized by site! Note that this means previously binary options for a farmer (e.g. Farmer took maize - Yes or No) now become fractions of the site (so fraction of the site that took maize, between 0 and 1).

2.3 Practical: examining data

Let’s now examine some key data features and the differences between treatment and control, this will also allow us to examine the distributions of the variables for hypothesis testing:

h1 <- ggplot(sdat, aes(x=Maize.acres,color=Treatment)) + geom_density(bw=0.05) +ggtitle("acres of maize")
h2 <- ggplot(sdat, aes(x=Maize.adoption,color=Treatment)) + geom_density(bw=0.1) +ggtitle("maize adoption") 
h3 <- ggplot(sdat, aes(x=Cookstove.qty,color=Treatment)) + geom_density(bw=0.05) +ggtitle("cookstove adoption")
h4 <- ggplot(sdat, aes(x=TotalCredit,color=Treatment)) + geom_density(bw=300) +ggtitle("Total credit")

#qq=qqnorm(supdat$TotalCredit, plot.it=FALSE)
multiplot(h1,h2,h3,h4,  cols=2)

From these graphs we can quickly see two things:

  • most of the data looks non-normal (see AMP lesson 1 for a refresher)

  • Treatment vs. Control differences look to be quite small (except perhaps maize adoption).

At this point we have re-worked our hypotheses into an appropriate PICOT format, summarized the data and examined the underlying distributions to aid our understanding of appropriate next steps. Let’s now take a quick detour to refresh our hypothesis testing understanding and introduce a couple of new concepts.

2.4 Theory refresher: Hypothesis testing

We have covered a few aspects of hypothesis testing already in AMP lessons 1 and 2, the key ideas to remember are:

  • A P-value of 0.05 means there is a 5% probability that our data (or more extreme data) could have been generated purely by chance

  • This means that if we compared two samples from a single population 100 times, we would expect to find 5 occasions where the differences looked significant (even though they were not).

  • This also means that if we test 20 hypothesis on two samples from a single population, then purely by chance one of those hypotheses will appear significant. Thus, it is important to limit the number of hypotheses tested in any experiment (we will be looking at other ways to correct this phenomenon in later lessons).

  • Hypothesis tests are exquisitely sensitive to the underlying data distribution

  • different hypothesis tests are suited to different uses

Hypothesis tests will generally be either “one sample” or “two-sample” tests. A one sample hypothesis test is used (unsurprisingly) for testing a single sample, such as an observed mean against a “true” (or theoretical) mean. A two-sample test is used for testing two independent samples once.

Tests can also be “one-tailed” or “two-tailed”. A one tailed test requires an assumption about directionality, we are essentially saying “test that X is larger than Y” or “test that X is less than Y” (depending on whether the tail is specified as greater-than or lesser-than, respectively). This compares to a two-tailed test where we make no assumption about directionality. This means each hypothesis can take one of four forms:

  • H0: \(\mu\)1 = \(\mu\)2
  • H1: \(\mu\)1 != \(\mu\)2
  • H1: \(\mu\)1 > \(\mu\)2
  • H1: \(\mu\)1 < \(\mu\)2

H0 is our null hypothesis formulation, whilst H1-H3 specify the test tails.

We can demonstrate the advantages and dis-advantages of the one-tailed test over the two-tailed test below with some simulated data:

#fake some data
set.seed(999)
da <- rnorm(1000,50,25)
db <- rnorm(1000,56,25)

#plot data
library(ggplot2, quietly = TRUE)

ggplot() + geom_density(aes(da), alpha=0.1, fill="blue") + geom_density(aes(db), alpha=0.1, fill="red")

These two distributions look somewhat different (note that I’ve set \(\mu\)1 = 50 and \(\mu\)2 =55 ).

As the data is normally distributed with equal variance we can use either the t-test or the Welch test. Running a two-sided Welch test will confirm the differences in these distributions with a low p-value:

t.test(da,db, alternative="two.sided")

    Welch Two Sample t-test

data:  da and db
t = -6.9159, df = 1994.5, p-value = 6.239e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -9.936525 -5.546082
sample estimates:
mean of x mean of y 
  49.1871   56.9284 

If we run a one sided test, specified as less-than (i.e. \(\mu\)1 < \(\mu\)2 ), we will also confirm this (note the lower P-value produced):

t.test(da,db, alternative="less")

    Welch Two Sample t-test

data:  da and db
t = -6.9159, df = 1994.5, p-value = 3.119e-12
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
      -Inf -5.899278
sample estimates:
mean of x mean of y 
  49.1871   56.9284 

However, if we run a greater-than one-sided test then we will see that we detect no significant difference:

t.test(da,db, alternative="greater")

    Welch Two Sample t-test

data:  da and db
t = -6.9159, df = 1994.5, p-value = 1
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -9.583329       Inf
sample estimates:
mean of x mean of y 
  49.1871   56.9284 

Note how in the above examples, the p-value reported for the one sided test was lower than that in the two-sided test. This is because we are changing how we allocate our probability. If we want to test two samples with a p-value threshold of 0.05, then by knowing a-priori the direction of change we can allocate the entire 0.05 to one side of the distribution (rather than splitting it into 2 * 0.025). This gives us a higher probability of detecting a difference between populations. In the diagram below you can see that a difference of 1.651 stdevs would be significant in the one-tailed test, but not in the two tailed test.

knitr::include_graphics("./one_two_tailed.png")

Note that you cannot merely apply a one-sided test twice (once in each direction) as then you are essentially setting your P-value threshold to 2*alpha (e.g. 0.1, rather than 0.05, in the above example).

It is somewhat unlikely that we will run one-sided tests for our farmer interventions at One Acre Fund, take a moment to think why that might be before expanding the box below.

2.5 Theory refresher: which hypothesis test?

Let’s briefly re-visit some of the tests we have covered so far (we will cover tests involving binary or categorical variables in lesson 3):

  • Two-sample T-test - this will test for equivalence of means. It assumes that data is normal-like and that both samples have equal variance (see lesson 1)

  • Two sample Welch test - this will test for equivalence of means. It requires normal-like distributions but variances can be different (see lesson 1).

  • Two-sample Wilcoxon test (aka Mann-Whitney U test) - this will test for similarity of medians. It requires independent samples but does not require normality (see lesson 1). Note that the Wilcoxon test is also suitable for coded ordinal data.

  • One-Sample Shapiro test - a test for normality on one sample. Specifically we are testing the null hypothesis that the data is from a normal distribution.

Either the Wilcoxon test or Welch test will be appropriate for OAF data in 95% of cases, notable exceptions are:

  • testing time-series data, as these are not independent. e.g. testing whether monthly values over a year from 100 farmers are similar to another 100 farmers. It is fine to test end-points, e.g. what is end-of season repayment for treatment and control groups

  • Discrete variables (e.g. answers that are Y/N or which have no clear ranking) cannot be tested with any of these tests (see lesson 3 instead). Ordinal data can be used with the Wilcoxon test (the Likert scale is a common example of Ordinal data)

  • IID violations. IID stands for independent and identically distributed. Many hypothesis tests assume IID, which is to say that each event recorded is independent of other events. For example rolling a six on a dice is independent of whatever I rolled previously, however independence is commonly violated by time-series data (as above) and in clustered data (meaning many hypothesis tests are unsuitable for clustered data analysis).

We will explore time-series data analysis in later lessons (if this is something of interest to you then please email Mike.Barber@ so I can get an idea of how important a task this is).

You can review AMP lesson 1 for more details on these hypothesis tests.

2.6 Practical: hypothesis testing non-compulsory maize

Let’s now return to hypothesis testing our non-compulsory maize data set.

Remember the earlier plots of cookstove adoption, maize acres, maize adoption and total credit (loan size):

h1 <- ggplot(sdat, aes(x=Maize.acres,color=Treatment)) + geom_density(bw=0.05) +ggtitle("acres of maize")



h2 <- ggplot(sdat, aes(x=Maize.adoption,color=Treatment)) + geom_density(bw=0.1) +ggtitle("maize adoption") 

h3 <- ggplot(sdat, aes(x=Cookstove.qty,color=Treatment)) + geom_density(bw=0.05) +ggtitle("cookstove adoption")
h4 <- ggplot(sdat, aes(x=TotalCredit,color=Treatment)) + geom_density(bw=300) +ggtitle("Total credit")

#qq=qqnorm(supdat$TotalCredit, plot.it=FALSE)
multiplot(h1,h2,h3,h4,  cols=2) 

We previously commented that these look non-normal, let’s test this with the Shapiro test (and our usual 0.1 p-value threshold for this test):

shapiro.test( subset(sdat, Treatment=="T")$TotalCredit)

    Shapiro-Wilk normality test

data:  subset(sdat, Treatment == "T")$TotalCredit
W = 0.96708, p-value = 0.07286
shapiro.test( subset(sdat, Treatment=="T")$Maize.acres)

    Shapiro-Wilk normality test

data:  subset(sdat, Treatment == "T")$Maize.acres
W = 0.89203, p-value = 2.772e-05
shapiro.test( subset(sdat, Treatment=="T")$Cookstove.qty)

    Shapiro-Wilk normality test

data:  subset(sdat, Treatment == "T")$Cookstove.qty
W = 0.58759, p-value = 1.982e-12
shapiro.test( subset(sdat, Treatment=="T")$Maize.adoption)

    Shapiro-Wilk normality test

data:  subset(sdat, Treatment == "T")$Maize.adoption
W = 0.19004, p-value < 2.2e-16

These all fall below our 0.1 threshold, which means we reject that they are normally distributed. We can now apply an appropriate (e.g. non parametric) hypothesis test to this data (alternatively, we could also transform the data to normality and then test it with a parametric method).

The aim of good data presentation should be to convey information accurately, yet succinctly. We can achieve this with some annotated bar graphs displaying the treatment and control means, errors, powers and p-values:

# remember this?
MCpower = function(sample1, sample2, size) {
  
  reps = 1000
    results  <- sapply(1:reps, function(r) {
        resample1 <- sample(sample1, size=size, replace=TRUE) 
        resample2 <- sample(sample2, size=size, replace=TRUE) 
        test <- wilcox.test(resample1, resample2, alternative="two.sided",paired=FALSE, correct=TRUE)
        test$p.value
    })
    sum(results<0.05)/reps
}




# plot

library(Hmisc, quietly = TRUE)
#loan size
t <- "TotalCredit" 
tp <- MCpower( subset(sdat,Treatment=="T")[[t]] , subset(sdat,Treatment=="C")[[t]] , size= length(subset(sdat, Treatment=="C")[[t]] ))

g1 <- ggplot(sdat, aes(Treatment,TotalCredit))  + 
    stat_summary(fun.y = mean, geom = "bar") +
     stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width=0.25) +
      ggtitle("Loan sizes")

g1 <- add_sub(g1, paste("stat power ~ ",round(tp,3),"\n","Wilcoxon P-value:",format(wilcox.test(subset(sdat, Treatment=="C")[[t]], subset(sdat, Treatment=="T")[[t]]  ,paired=FALSE, correct=TRUE)$p.value,scientific = TRUE,digits=3) ))


#maize acres
t <- "Maize.acres" 
tp <- MCpower( subset(sdat,Treatment=="T")[[t]] , subset(sdat,Treatment=="C")[[t]] , size= length(subset(sdat, Treatment=="C")[[t]] ))

g2 <- ggplot(sdat, aes(Treatment,Maize.acres)) + stat_summary(fun.y = mean, geom = "bar") + stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width=0.25) + ggtitle("Maize.acres")  
g2 <- add_sub(g2, paste("stat power ~ ",round(tp,3),"\n","Wilcoxon P-value:",format(wilcox.test(subset(sdat, Treatment=="C")[[t]], subset(sdat, Treatment=="T")[[t]]  ,paired=FALSE, correct=TRUE)$p.value,scientific = TRUE,digits=3) ))

#maize adoption
t <- "Maize.adoption" 
tp <- MCpower( subset(sdat,Treatment=="T")[[t]] , subset(sdat,Treatment=="C")[[t]] , size= length(subset(sdat, Treatment=="C")[[t]] ))

g3 <- ggplot(sdat, aes(Treatment,Maize.adoption)) + stat_summary(fun.y = mean, geom = "bar") + stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width=0.25) + ggtitle("Maize.adoption")  
g3 <- add_sub(g3, paste("stat power ~ ",round(tp,3),"\n","Wilcoxon P-value:",format(wilcox.test(subset(sdat, Treatment=="C")[[t]], subset(sdat, Treatment=="T")[[t]]  ,paired=FALSE, correct=TRUE)$p.value,scientific = TRUE,digits=3) ))
#cookstove qty
t <- "Cookstove.qty" 
tp <- MCpower( subset(sdat,Treatment=="T")[[t]] , subset(sdat,Treatment=="C")[[t]] , size= length(subset(sdat, Treatment=="C")[[t]] ))

g4 <- ggplot(sdat, aes(Treatment,Cookstove.qty)) + stat_summary(fun.y = mean, geom = "bar") + stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width=0.25) + ggtitle("Cookstove.qty")  
g4 <- add_sub(g4, paste("stat power ~ ",round(tp,3),"\n","Wilcoxon P-value:",format(wilcox.test(subset(sdat, Treatment=="C")[[t]], subset(sdat, Treatment=="T")[[t]]  ,paired=FALSE, correct=TRUE)$p.value,scientific = TRUE,digits=3) ))

#assemble
library(cowplot, quietly = TRUE)
plot_grid(g1,g2,ncol =2, nrow = 1)

plot_grid(g3,g4 , ncol = 2, nrow = 1)

Note that I’ve made a simple function (nicebars) that will produce graphs similar to those above, I recommend you use this function in presenting your own data:

nicebars <- function(varname, dataframe1, treatname){
  
 
  dataframe1$temp = unlist(dataframe1[varname])
  
    dataframe1$treated = unlist(dataframe1[treatname])
  t1 =  unique(dataframe1$treated)[1]
  t2 =  unique(dataframe1$treated)[2]
  
  colnames(dataframe1)[which(colnames(dataframe1)==treatname)] = "treatname"
  tp = MCpower( subset(dataframe1,treated==t1)$temp , subset(dataframe1,treated==t2)$temp , size= length(subset(dataframe1, treated==t2)$temp ))
  

  
  gx = ggplot(dataframe1, aes(treatname,temp)) + stat_summary(fun.y = mean, geom = "bar") + stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width=0.25) + ggtitle(varname)  
  gx = add_sub(gx, paste("stat power ~ ",round(tp,3),"\n","Wilcoxon P-value:",format(wilcox.test(subset(dataframe1, treated==t1)$temp, subset(dataframe1, treated==t2)$temp  ,paired=FALSE, correct=TRUE, exact=FALSE)$p.value,scientific = TRUE,digits=3) )) 
  library(cowplot)
plot_grid(gx,ncol =1, nrow = 1)
}

# nicebars("Cookstove.qty", sdat, "Treatment")

nicebars requires three arguments:

  • varname - the name of the variable you want to plot

  • dataframe1 - the dataframe to plot from

  • treatname - the variable indicating treatment/control

And is called like this nicebars("Cookstove.qty", sdat, "Treatment").

We can see from the above that 3/4 of our hypothesis are significantly under-powered (refresh your power understanding). We can use the MDE concept introduced in AMP lesson 2 to understand the effect size we would need to achieve a power of 0.8 with our current sample size:

posthoc_mde = function(d1,d2){
  
  require(pwr)
  
  #print(paste("mean diff (d2-d1)", mean(d2, na.rm=TRUE)-mean(d1,na.rm=TRUE)))
  s1= sd(d1,na.rm=TRUE)
  s2=sd(d2,na.rm=TRUE)
  spo = sqrt((s1**2 + s2**2)/2)
  mded = pwr.t.test(n = length(d1), d=NULL, sig.level = 0.05, power=0.8, type="two.sample", alternative = "two.sided")$d
  mde_out = mded * spo
  return(mde_out)
}

mde1 = round(posthoc_mde(subset(sdat, Treatment=="T")$Maize.adoption, subset(sdat, Treatment=="C")$Maize.adoption),3)
mde2 = round(posthoc_mde(subset(sdat, Treatment=="T")$TotalCredit, subset(sdat, Treatment=="C")$TotalCredit),3)
mde3 = round(posthoc_mde(subset(sdat, Treatment=="T")$Maize.acres, subset(sdat, Treatment=="C")$Maize.acres),3)
mde4 = round(posthoc_mde(subset(sdat, Treatment=="T")$Cookstove.qty, subset(sdat, Treatment=="C")$Cookstove.qty),3)

ac1 = round(mean(subset(sdat, Treatment=="T")$Maize.adoption, na.rm=TRUE) - mean(subset(sdat, Treatment=="C")$Maize.adoption, na.rm=TRUE),3)
ac2 = round(mean(subset(sdat, Treatment=="T")$TotalCredit, na.rm=TRUE) - mean(subset(sdat, Treatment=="C")$TotalCredit,na.rm=TRUE),3)
ac3 = round(mean(subset(sdat, Treatment=="T")$Maize.acres,na.rm=TRUE) - mean(subset(sdat, Treatment=="C")$Maize.acres, na.rm=TRUE),3)
ac4 = round(mean(subset(sdat, Treatment=="T")$Cookstove.qty,na.rm=TRUE) - mean(subset(sdat, Treatment=="C")$Cookstove.qty, na.rm=TRUE),3)

From the above, we can report the MDE (at 0.8 power), and the actual measured effect, for each hypothesis:

  • H1 maize adoption (measured as average maize acres per site):
    • MDE(0.8) = 0.06
    • actual = 0.018 additional acres
  • H2 maize adoption (measured as and the percent of farmers who chose to purchase any maize in a site),
    • MDE(0.8) = 3.5 % pts
    • actual = 6 % pts
  • H3 transaction size (measured as the average client loan size per site),
    • MDE(0.8) = 534.799 Kenyan Shillings
    • actual = 6.297 Kenyan shillings (~100 KES = 1USD)
  • H4 cookstove adoption (measured as the percentage of farmers with 1+ cookstoves per site),
    • MDE(0.8) = 1.3 % pts
    • actual = 0.2 % pts

This means that with our current set up, we would need to have effects equal to, or larger than, those listed above to achieve a power of 0.8. We can also say that we are fairly confident that the true effects do not exceed the values above, or we would have had an 80% probability of detecting them.

2.6.1 Interpreting hypothesis tests

I want you now to take a moment and think about the results above, specifically:

  • What are the error bars showing?

  • What does the height of the bar represent?

  • What does is mean to have a stat power of 0.1, or 0.7 ?

  • What does is mean to have a p-value of 0.1 (aka 1e-01)?

  • What do the MDE values mean?

Think about these concepts in relation to each of the 4 hypotheses outlined above, and decide what conclusions you can and cannot draw. This exercise will form part of the homework assignment for this lesson, please note your conclusions and book an office hours slot with me to discuss them (alternatively, as a second-preference option, you can email me the conclusions at Mike.Barber@).

2.7 Summary

  • We can use hypothesis tests and power calculations in an RCT to draw conclusions about our hypotheses.

  • There is no one-size fits all hypothesis test, chose the appropriate test wisely!

  • MDEs allow us to understand what effect size we would need to have sufficient power

  • Remember that in an RCT, as long as our treatment and control groups are balanced, then these simple hypothesis tests will tell us all we need to know about a study. Our regression results should not differ from these tests, if they do then it suggests something is a-miss!

  • We use different hypothesis tests under different environments, remember which test to use when!

3 Theory: Regressing independent data

Up to this point we’ve scrutinized the data and executed initial hypothesis tests to understand effect sizes and statistical power. We’ll now turn to the second major tool in the analyst’s tool set: linear regressions.

To run a linear regression is simple. Just lm(x,y) and hey-presto, results! Easy, right?

Well yes, and no. Linear regressions have a set of assumptions and caveats associated with them which need to be checked before we interpret the regression results. If we violate these assumptions then the code will still run, and you will still get co-efficients and p-values (etc), but they will likely be meaningless! This means that a bad linear regression can result in recommending interventions which actually have no effect, or missing interventions which do have an effect.

A little statistics can be a dangerous thing!

3.1 Criteria for a good model

We will here define a good model according to the following criteria:

  • model is capable of explaining/predicting the full range of Y values

  • model can accurately provide confidence intervals to inform our understanding of uncertainty (and significance)

  • model is not distorted by outliers

We can satisfy these requirements by fulfilling the following criteria:

  • the relationship between X and Y is linear

  • Errors are independent of X (a.k.a. homoskedascity)

  • variables are mostly un-correlated with each other

  • Residuals are normally distributed

Therefore, to build a good model, we will need to understand and evaluate the underlying assumptions outlined above. We will do this through a series of plots that examine a number of things, but perhaps most importantly, the residuals of a model:

  • A residual is the difference between a model prediction and an actually observed value.

We will explore these plots first with a well-behaved regression, to understand how plots look in a best-case scenario, and then how these plots react to heteroskedascity and non-linearity.

3.2 Theory: A well-behaved regression

First, a well behaved regression model. We will (hopefully) remember the definition of a straight line:

\(Y= \beta . X + c\) - where \(c\) is our intercept and \(\beta\) our coefficient

We will use this equation to understand regression, with the addition of an \(e\) term to represent the independent error around our mean. When running a regression, our aim is to determine:

\(Y= \beta . X + c + e\)

so that we can accurately determine the size of coefficient \(\beta\) (and so our treatment effect) and \(e\) (so we can determine its significance).

Let’s use a worked example to demonstrate how regression works. First, let us define our relationship (with the aim of recovering these values from the regression results):

  • Intercept = \(10\)
  • \(\beta\) = \(2\)
  • Error term = $= 3 $

We’ll simulate 200 independent values based on this equation now:

set.seed(111)
ex1 <- data.frame("Xvals"=runif(1:200, min=1,max=10) )
ex1$Yvals = 10 + 2*ex1$Xvals + rnorm(200,0,3) # y = \beta . X +c +e

head(ex1[c("Xvals","Yvals")])
##      Xvals    Yvals
## 1 6.336832 24.47252
## 2 7.538330 21.59567
## 3 4.333798 19.98488
## 4 5.634314 21.88319
## 5 4.398969 16.70039
## 6 4.765036 16.75019

Let’s now regress Xvals against Yvals, to ask: “what correlation is there between Xvals and Yvals?”.

exreg1 <- lm(Yvals~Xvals, data=ex1)

As our regression is relatively simple, with only one independent variable, we can easily visualize the results with a simple plot:

plot(ex1$Xvals, ex1$Yvals,  pch = 16, cex = 1.3, col = "blue", main = "Well-behaved regression", xlab="Xvar", ylab="Yvar")
abline(exreg1, lw=5) 

We can see the black line does a good job of describing the relationship between Xvals and Yvals. Also note that the blue points are relatively equi-distant from the regression line; this is something we care about a lot about in regression analysis. The distance from each point (blue) to the model (black line) is known as the residual, we will collect these together (pl. residuals) to understand our model performance (below).

Before analyzing model residuals, let’s quickly examine the regression summary:

library(stargazer)
stargazer(exreg1, type = "html",style="all", title="well-behaved regression example 1",notes="", notes.append=FALSE,star.char="", omit.table.layout = "na")
well-behaved regression example 1
Dependent variable:
Yvals
Xvals 2.071
(0.080)
t = 25.906
p = 0.000
Constant 9.786
(0.489)
t = 20.019
p = 0.000
Observations 200
R2 0.772
Adjusted R2 0.771
Residual Std. Error 2.911 (df = 198)
F Statistic 671.096 (df = 1; 198) (p = 0.000)
# summary(exreg1)

Let’s refresh what these terms mean before we proceed:

  • P-value: The p-value derives from testing the null hypothesis that the coefficeint = 0. If the p-value is below a certain threshold (usually 0.05) then we reject the null hypothesis and call the result “significant”. Note that significance in statistics means something like “reliable”. So a coefficient can be very small but still statistically significant (reliable).

  • Coefficient: Each coefficient estimates the change in the mean response per unit increase in X when all other predictors are held constant.

  • Constant: The intercept term in our equation.

  • Observations: The number of rows in our dataframe.

  • \(R^2\) R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. R-squared = Explained variation / Total variation. R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!

  • Adjusted R2 The adjusted R-squared compares the explanatory power of regression models that contain different numbers of predictors. The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but it’s usually not. It is always lower than the R-squared.

  • F statistic Does the addition of variables improve the regression model overall? The F-statistic reports on the null hypothesis that Our current model is indistinguishable from an intercept-only model. Rejection of this hypothesis (p<0.05) suggests that the inclusion of variables is helpful.

We can see that we nicely recover the intercept and coefficient terms (plus some confidence intervals proportional to our error term):

  • Intercept = 9.79
  • \(\beta\) = 2.07

Next, let us examine the diagnostic plots to evaluate how the model fulfills the assumptions we stated above:

par(mfrow=c(2,2))
plot(exreg1)

Let’s take these plots one by one and diagnose them:

plot(exreg1, which=1)

  • This plot shows the model residuals vs fitted values. Ideally, we want the residuals to be evenly distributed around the dotted black line at 0. The red line indicates the mean residual value, and ideally would lie on the 0.0 black line

  • A non-flat red line indicates that our data may be non-linear (think through why this is, using the “Xvals vs Yvals” graph above).

  • Points with un-even or non-symmetrical scatter around the 0.0 line represent either heteroskedascity or non-linearity (again - think through why this is so).

  • A nice summary can be seen here ).

plot(exreg1, which=2)

  • This is a QQplot of residuals. The presence of trend deviations from the diagonal line (which represents perfect normality) suggest non-normality in our residuals (see AMP Lesson 1 for more information). We don’t see any trends in our residuals here (the scatter at high and low X is minor and unalarming as it does not constitute a trend).
plot(exreg1, which=3)

  • The scale-location plot shows the square root of the standardized residuals as a function of the fitted values. This is similar to our residuals vs fitted plot above, and should be interpreted similarly.
plot(exreg1, which=5)

  • Finally, the leverage plot measures the impact of each point on our final output vs the residual of that point. Essentially, this graph will aid us is spotting data points that are having an outsize influence on our model. If some points have high leverage and a high (absolute) residual value then they will be skewing our model more than other data points. Cook’s distance (the dotted lines) is a measurement combining residual size and leverage.

  • Note that we don’t see any Cook’s distance lines on the above as our Cook’s values are too small (this is a good thing!)

Overall, these plots look reasonable, and I am happy that we have satisfied the linear model requirements. Note that we could also use the Shapiro test on the QQplot of residuals to test for normality:

resi1 <- shapiro.test( rstandard(exreg1)) #standardised residuals. i.e. divided by stdev
resi1

    Shapiro-Wilk normality test

data:  rstandard(exreg1)
W = 0.99115, p-value = 0.2613

Our Shapiro test indicates normal residuals.

In summary, this regression looks very valid:

  • The residuals vs fitted plot shows a flat, straight line and no obvious pattern in the residuals

  • The QQplot of residuals likewise looks reasonable. Whilst we can see a few deviations from the dashed-line, these are not what I would call “trend deviations”. We can use a Shapiro test on the residuals to confirm this (p= 0.2613), though in general I recommend simply examining the QQplot rather than running repeated hypothesis tests.

  • Our scale-location plot likewise is flat and without pattern

  • Our residuals vs leverage plot does not reveal any clear outliers.

3.3 Theory: Non-linear relationships

Let’s now turn to a poorly behaved regression, specifically, we will examine the diagnostic plots produced from a non-linear relationship and compare them to the plots above:

ex2 <- data.frame("Xvals"=runif(1:200, min=1,max=10) )
ex2$Yvals = 10 + ex2$Xvals**3 + rnorm(200,0,15)



plot(ex2$Xvals, ex2$Yvals,  pch = 16, cex = 1.3, col = "blue", main = "non-linear data", xlab="Xvar", ylab="Yvar")
abline(lm(Yvals~ Xvals ,data=ex2), lw=5) 

Hmm, clearly here we have tried to fit a linear model to a non-linear relationship! This model is therefore useless (remember our definition of a good model from the start of this section). Note, that whilst we can clearly see our model is terrible (i.e. it would do a bad job of predicting Ys given a set of Xs) we are still able to achieve a great R-squared of 0.856. Do not blindly follow R-squareds into oblivion!

Let’s look at the diagnostic plots:

exreg2 <- lm(Yvals~Xvals, data=ex2)



par(mfrow=c(2,2))
plot(exreg2)

Compare these plots to the well-behaved regression plots to understand how they vary under non-ideal conditions. Likewise, spend some time trying to understand how these plots related to the simple X-Y plot above.

For example:

  • we can note that our residuals cross zero twice in our residuals vs fitted plot, and that our regression line crossed our non-linear data point trend twice also.

  • Likewise our scale-location plot

  • Note also that our residuals vs leverage plot, we clearly see a trend towards higher leverage and higher residuals (as our data points get further and further from our trend line)

If we examine our QQplot more closely:

plot(exreg2, which=2)

  • We’ll note that our residuals plot the trend we would expect from the X-Y plot (that is, the points in the X-Y plot are far from the line, get closer, and then further again).

Spend some time examining the relationships within these plots to improve your ability to quickly diagnose which linear-model assumption you are violating.

3.4 Theory: Poorly-behaved regression, heteroskedastic data

Finally, let’s look at what extreme heteroskedasticity looks like (remember: heteroskedascity is when our error term is dependent on X):

ex3 <- data.frame("Xvals"=runif(1:200, min=1,max=10) )

ex3$Yvals = 10 + 3*ex3$Xvals + rnorm(200,1,1+4*ex3$Xvals)


plot(ex3$Xvals, ex3$Yvals,  pch = 16, cex = 1.3, col = "blue", main = "heteroskedastic-data ", xlab="Xvar", ylab="Yvar")
abline(lm(Yvals~ Xvals ,data=ex3), lw=5) 

Note the cone-shaped data indicating heteroskedascity.

exreg3 <- lm(Yvals~Xvals, data=ex3)

par(mfrow=c(2,2))
plot(exreg3)

Again, compare this to our well-behaved regression to understand how heteroskedastic data appears in regression plots (hint: examine the residuals vs fitted plots, and the scale-location plots closely).

Remember - in a real regression analysis with multiple factors you will not have a simple Xvar-Yvar regression plot to easily diagnose your model, so you will need to be comfortable using these diagnostics alone!

Finally, I want to introduce a plot that isn’t automatically produced by R, but that is incredibly useful in regression diagnostics. In this graph, we will plot our fitted values vs our real (observed) values to see how the model performs, let’s plot these graphs for the above regression examples:

g1 = ggplot() +geom_point(aes(x=predict(exreg1), y=ex1$Yvals)) + xlab("Fitted") + ylab("Actual") + ggtitle("Well-behaved regression")
g2 = ggplot() +geom_point(aes(x=predict(exreg2), y=ex2$Yvals)) + xlab("Fitted") + ylab("Actual") + ggtitle("Non-linear regression")
g3 = ggplot() +geom_point(aes(x=predict(exreg3), y=ex3$Yvals)) + xlab("Fitted") + ylab("Actual") + ggtitle("Heterosked. regression")



multiplot(g1,g2,g3, cols=2)

Spend some time examining and understanding these plots, and how these models compare against the criteria we outlined above:

  • model is capable of explaining both low and high Y values

  • model can accurately provide confidence intervals to inform our understanding of variability (and significance)

  • model is not distorted by outliers

  • given a set of Xs, our model can accurately predict Ys

3.5 Summary:

  • Regressions have assumptions that must be met to produce meaningful results

  • Diagnostic plots will allow you to assess the extent to which your model is meeting these assumptions

  • Always present diagnostic plots for any model you create! You can keep these in an appendix at the end of the document/presentation.

4 Theory: data transformations

One of the approaches to improving diagnostic plots for a regression is transforming data. Generally, either X or Y variables will be transformed. We will focus on Y-variable transformations here, as they are much more commonly practiced at One Acre Fund than X transformations.

Common y transformations include:

  • \(ln\) or \(log10\) transformations

  • Simple power transformations (square root or squared)

It is commonly assumed that when y-variables are transformed we can merely back-transform (that is, apply the inverse function to) the model coefficients to make them interpretable.

This is incorrect! Transforming a y-variable with alter the coefficients in more complex ways!

Generally, back-transforming coefficients for y transformed data will require solving complex differential equations (there is a notable exception here, which is covered later in the lesson), and is not as simple as a back transformation. Let’s demonstrate this by returning to our well-behaved regression:

ex4 <- data.frame("Xvals"=runif(1:200, min=1,max=10) )
ex4$Yvals = 10 + 2*ex4$Xvals + rnorm(200,0,5) # y = \beta . X +c +e

exreg4 <- lm(Yvals ~ Xvals, data=ex4)

stargazer(exreg4, type = "html",style="all")
Dependent variable:
Yvals
Xvals 2.209***
(0.148)
t = 14.907
p = 0.000
Constant 8.683***
(0.951)
t = 9.135
p = 0.000
Observations 200
R2 0.529
Adjusted R2 0.526
Residual Std. Error 5.360 (df = 198)
F Statistic 222.225*** (df = 1; 198) (p = 0.000)
Note: p<0.1; p<0.05; p<0.01

Remember that we calculated Y so that \(Y = 10 + 2\beta X + e\) (where e is our error term). We can see this regression nicely returns this to us.

Let’s instead regress against the square root of Y, we can see our diagnostic plots still look reasonable:

ex4$sqrtY = sqrt(ex4$Yvals)

par(mfrow=c(2,2))
plot(lm(sqrtY ~ Xvals, data=ex4) )

And our coefficients:

stargazer( lm(sqrtY ~ Xvals, data=ex4) , type="html")
Dependent variable:
sqrtY
Xvals 0.253***
(0.018)
Constant 3.079***
(0.112)
Observations 200
R2 0.513
Adjusted R2 0.511
Residual Std. Error 0.634 (df = 198)
F Statistic 208.572*** (df = 1; 198)
Note: p<0.1; p<0.05; p<0.01

This time we get an intercept of 3.0789229 and \(\beta\) of 0.253. If we square these (square being the inverse of square root) then we get:

  • Intercept = 9.48
  • Xvals = 0.064

Clearly, these do not relate to either the original formula or the original regression. Do not make the mistake of blindly back-transforming coefficients!

If transforming variables is necessary, then it is wise to pick a transformation that retains some physical meaning. For example taking the square root of an area (which would yield length). An important danger with transformations is introducing new censures in the data, for example, taking “change in farmer expenditure (USD)” squared would make all data positive, and so make model interpretation impossible.

One notable exception to these transformations is the natural-log (in r: log()). When Y data is log transformed then coefficients can be interpreted as the % change expected in Y with a one unit increase in X.

4.1 Summary

  • Do not blindly transform and back-transform data!

  • Transformed data often yields un-interpretable coefficients (without solving the appropriate equations)

  • Natural-log transformations are an important exception to this rule. ln transformations yield coefficients where a one unit increase in X yields a \(\beta *100\) increase in Y.

5 Practical: non-compulsory maize regression

Let’s now turn to our real-life One Acre Fund data example, and begin with a really simple regression testing TotalCredit(y) against Maize.adoption (x). This is asking “what correlation is there between maize.adoption and TotalCredit”.

bad_reg <- lm(TotalCredit~ Maize.adoption ,data=sdat)


#this library just makes pretty summaries
library(stargazer, quietly = TRUE)
#I like this layout for regression summaries, feel free to explore the many, many options in stargazer. Alternatively, feel free to just use my line of code here (and swap out "bad_reg" for your regression model varibale)
stargazer(bad_reg, type = "html",style="all", title="Simple regression example 1",notes="", notes.append=FALSE,star.char="", omit.table.layout = "na")
Simple regression example 1
Dependent variable:
TotalCredit
Maize.adoption 4,262.198
(1,186.132)
t = 3.593
p = 0.0005
Constant 5,554.371
(1,137.532)
t = 4.883
p = 0.00001
Observations 133
R2 0.090
Adjusted R2 0.083
Residual Std. Error 1,046.634 (df = 131)
F Statistic 12.912 (df = 1; 131) (p = 0.0005)

We can see that we have a large coefficient and a very small p-value, so Maize adoption seems to have a large effect on total credit. A solid finding!

… Or perhaps not, let’s dive a little deeper. As we only have two variables in this model (X and Y), we can easily visualize what our regression looks like:

5.1 Plotting simple relationships and diagnosing our regression

plot(sdat$Maize.adoption, sdat$TotalCredit,  pch = 16, cex = 1.3, col = "blue", main = "Plot of linear regression", xlab="Maize adoption fraction (0-1)", ylab="Total Credit (Kenya shillings)")
abline(lm(TotalCredit~ Maize.adoption ,data=sdat), lw=5) 

The straight black line represents our model fit. Whilst the points represent the actual data values. Hopefully you can already see that this isn’t looking too healthy (the most obvious flaw from this graph is that most of our points can be found at 1.0 (100%) Maize adoption (i.e. our data is censured)). It also looks like our line isn’t quite equi-distant from all our points (i.e. our residuals look unhealthy).

Remember the key requirements of a good model include:

  • model is capable of explaining both low and high Y values

  • model can accurately provide confidence intervals to inform our understanding of variability

  • model is not distorted by outliers

  • given a set of Xs, our model can accurately predict Ys

Let’s examine the diagnostic plots for this model to understand to what extent we are meeting these requirements:

par(mfrow=c(2,2))
plot(bad_reg)

Let’s take these plots one by one and diagnose them:

plot(bad_reg, which=1)

  • For our plot above, the red line is reasonable flat (excluding the left-most point, which is driving the red-line’s trend), but we might have a small relationship between the fitted values and the residuals (i.e. our data might be slightly heteroskedastic).
plot(bad_reg, which=2)

  • For the plot above, we can see trend deviations away from the perfect normal at both high and low quantiles, this suggests our residuals are not normally distributed.
plot(bad_reg, which=3)

  • Once again, it looks like our residuals depend somewhat on our fitted value (note the cone-like shape of points at the right side of the plot)
plot(bad_reg, which=5)

  • Most points are low leverage (and within the Cook’s distance boundaries marked on the plot).

  • Point # 99 is a clear outlier with high leverage (although not alarming residuals), a single outlier isn’t too bad (we’ll see how to deal with these below).

Summarizing the findings from these plots we can say that our current regression suffers from non-normal residuals and slight heteroskedasticity (the leverage plot doesn’t look too bad, as our only outlier has a small residual).

5.2 Practical: Interpreting our regression

Let’s now plot a fitted vs actual graph to gain some insight into model performance:

par(mfrow=c(1,1))
#first, lets add a column of fitted values like this
sdat$fitted <- predict(bad_reg)

#let's XY plot them, and then add a new regression line for fitted vs real
plot(sdat$TotalCredit, sdat$fitted,  pch = 16, cex = 1.3, col = "blue", main = "Fitted vs real", xlab="Real", ylab="Fitted")
abline(lm(fitted~ TotalCredit ,data=sdat), lw=5) 

This plot pretty quickly makes clear that our regression model is useless! Our fitted values occupy a far, far smaller range than our real values, and the model generally fails to predict the “true” values well.

Now let’s revisit our regression results from earlier:

stargazer(bad_reg, type = "html",style="all", title="Simple regression example 1",notes="", notes.append=FALSE,star.char="", omit.table.layout = "na")
Simple regression example 1
Dependent variable:
TotalCredit
Maize.adoption 4,262.198
(1,186.132)
t = 3.593
p = 0.0005
Constant 5,554.371
(1,137.532)
t = 4.883
p = 0.00001
Observations 133
R2 0.090
Adjusted R2 0.083
Residual Std. Error 1,046.634 (df = 131)
F Statistic 12.912 (df = 1; 131) (p = 0.0005)

How much do you trust these values now? How willing would you be to stake farmer welfare on these results?

… Maybe we shouldn’t have been so excited after all!

5.3 Summary

  • Regressions require skill and time! The more regressions you run, the faster you will be at diagnosing (and fixing) them.

  • Make sure to check the diagnostic plots for model quality (the 4 from R and the fitted vs actual plot)

  • It is incredibly important that you start to provide diagnostic plots as standard practice when sharing/publishing any and all One Acre Fund results! Without these plots I (and future analysts) cannot understand the robustness of your work. Providing all the diagnostic plots (and decisions) will ensure that your work can be used years down the line!

  • 95% of One Acre Fund analyses I have seen fail to report any diagnostic plots. Such analyses are useless for decision making, and represent a significant investment in time by an analyst that is subsequently wasted.

  • Also note, it is a common false assumption at One Acre Fund that your data needs to be normal for a linear regression. This is false, your residuals need to be normal, the shape of your data doesn’t matter (except for how it influences your residuals)!

6 Practical: Improving our NCM regression model

Hopefully it is now clear that a usable regression requires some “artful science”. Let’s see how to do this.

So, what could we have done to improve our regression? We generally have 2 options:

  • Add covariates - adding more factors will help to define the relationship and may improve our diagnostics. Adding factors will also change the interpretation of the model and so should be done thoughtfully. Note that sometimes OAF trials will have a pre-analysis plan outlining the models to be tested, we can build up to these models or alter them (reasonably) to achieve a trust-worthy model.

  • Transform the data - this is a tricky one. If there is a rational interpretation of the transformed variable then go ahead (e.g. taking the sqrt of a farmer’s acreage yields a reasonable metric - length), if there is not a rational interpretation (e.g. \(TotalCredit^{-1.3}\)) then think twice. Once a y-variable has been transformed, you cannot merely back-transform the coefficient to make it interpretable, this is not how the underlying maths work.

Following the first strategy, let’s add just Treatment status as a covariate and re-evaluate the model:

creditreg <- lm(TotalCredit~ Maize.adoption + Treatment ,data=sdat)

par(mfrow=c(2,2))
plot(creditreg)

This looks better, but not great, specifically note the cone shaped residuals in the residuals vs fitted plot and the substantial trend deviations in the QQplot. Let’s test if our residuals are normal:

shapiro.test(resid(creditreg) )

    Shapiro-Wilk normality test

data:  resid(creditreg)
W = 0.97653, p-value = 0.02111

P-value is below 0.1, indicating non-normality. Let’s add another few factors which we rationally expect to influence TotalCredit:

creditreg2 <- lm(TotalCredit~ Maize.adoption + Maize.acres + Treatment + DistrictName + TotalEnrolledSeasons , data=sdat)

par(mfrow=c(2,2))
plot(creditreg2)

These are looking much healthier, note the Residuals vs Fitted, and Scale-Location graphs in particular. However our QQplot still looks a little weak:

plot(creditreg2, which=2)

I mentioned above that % (or fraction) data is a little tricky, this is because it is often censured at the lower or upper bound (i.e. it’s impossible to get a % lower than 0 or higher than 100 for maize adoption). We could run a Tobit regression here (which will be covered in the AMP FAQ), alternatively we could make a transformation of our data.

sdat$logCredit = log(sdat$TotalCredit)


creditreg3 <- lm(logCredit~ Maize.adoption + Maize.acres + Treatment + DistrictName + TotalEnrolledSeasons , data=sdat)

par(mfrow=c(2,2))
plot(creditreg3)

shapiro.test(resid(creditreg3))

    Shapiro-Wilk normality test

data:  resid(creditreg3)
W = 0.98966, p-value = 0.4262

Taking a \(ln(TotalCredit)\) transformation clears up our remaining issues nicely and our QQplot is now close enough to normal (observe the reasonable Shapiro result).

However, remember our point from the theory section:

transforming the y-variable has also changed the interpretation of our coefficients.

The only time you can reasonably interpret these coefficients (without solving the associated differential equations) is when we have a natural-log transformation (ln). In the case of a ln(Y) variable, our coefficient * 100 will yield the percentage increase in Y from an increase of one unit in X.

6.1 log(TotalCredit) regression

Let’s examine the results from our log transformed regression model:

stargazer(creditreg3, type = "html",style="all", title="ln(Y) regression model",notes="", notes.append=FALSE,star.char="*", omit.table.layout = "na")
ln(Y) regression model
Dependent variable:
logCredit
Maize.adoption 0.091
(0.100)
t = 0.910
p = 0.365
Maize.acres 0.604***
(0.082)
t = 7.402
p = 0.000
TreatmentT -0.011
(0.012)
t = -0.891
p = 0.375
DistrictNameKakamega (South) -0.095***
(0.018)
t = -5.161
p = 0.00000
DistrictNameMigori 0.106***
(0.016)
t = 6.460
p = 0.000
DistrictNameTinderet 0.009
(0.035)
t = 0.250
p = 0.803
TotalEnrolledSeasons 0.027*
(0.014)
t = 1.957
p = 0.053
Constant 8.711***
(0.075)
t = 116.165
p = 0.000
Observations 133
R2 0.710
Adjusted R2 0.693
Residual Std. Error 0.063 (df = 125)
F Statistic 43.644*** (df = 7; 125) (p = 0.000)

So interpreting the results above:

  • Non significant factors (p > 0.05) include; maize adoption and treatment

  • Significant factors: Maize acres, district, (borderline - total enrolled seasons)
  • Maize acres; for every additional acre, the total credit taken increases by 60% (0.6*100).
  • Being associated with Migori district leads to a11% increase in total credit.

6.2 Practical: Robust regressions

Finally, I want to quickly demonstrate the value of a robust regression. A robust regression is used when we have a few high-leverage points (remember point #99?) which might adversely effect our outcome (p-values, coefficients, etc). A robust regression will assign a lower weight to these points (vs. other points), ensuring that the model isn’t overly dependent on outliers. A robust regression isn’t required for the TotalCredit regression (as we have no points with high Cook’s distance), but let us run through it as an example anyway.

Rather than calling lm from base R, we will call lmrob from robust (see details here).

library(robust, quietly = TRUE)
creditrob <-lmrob(formula = logCredit~ Maize.adoption + Maize.acres + Treatment + DistrictName + TotalEnrolledSeasons , data=sdat)

We will interpret the diagnostics similarly to the standard linear models above:

par(mfrow=c(2,2))
plot(creditrob)

These plots indicate a model performing as expected, note that the lmrob library plots our fitted vs actual plot for us (and so is included above).

6.3 Practical: Comparing models

A common issue in statistics is over-fitting. Briefly, over-fitting refers to the tendency for a model to be improved (by some amount) by the addition of new parameters. In statistics we can ask the question: “Does the addition of parameter X result in a model that is significantly different to our original model?”. There are several metrics for trying to compare the quality of models with different numbers of parameters, in this lesson we will cover the ANOVA test, AIC and adjusted R-squared. All of which penalize models for additional parameters.

Let’s compare the models we have built thus far with the the stargazer() command:

stargazer(creditreg, creditreg2, creditreg3, creditrob,type="html", title="Comparison of models", column.labels = c("Simple model","complex model", "log model","Robust log model"), dep.var.labels.include = FALSE, dep.var.caption = "Dependent variable: Total Credit ", digits=2)
Comparison of models
Dependent variable: Total Credit
OLS OLS MM-type
linear
Simple model complex model log model Robust log model
(1) (2) (3) (4)
Maize.adoption 5,007.96*** 92.35 0.09 0.11
(1,282.17) (977.58) (0.10) (0.10)
Maize.acres 6,082.33*** 0.60*** 0.60***
(793.88) (0.08) (0.09)
TreatmentT -292.61 -59.28 -0.01 -0.02
(196.21) (121.60) (0.01) (0.01)
DistrictNameKakamega (South) -936.62*** -0.09*** -0.09***
(178.24) (0.02) (0.02)
DistrictNameMigori 1,009.89*** 0.11*** 0.11***
(159.29) (0.02) (0.01)
DistrictNameTinderet 42.45 0.01 0.01
(337.05) (0.03) (0.04)
TotalEnrolledSeasons 258.07* 0.03* 0.03*
(133.00) (0.01) (0.01)
Constant 4,988.85*** 5,849.94*** 8.71*** 8.69***
(1,194.07) (729.89) (0.07) (0.07)
Observations 133 133 133 133
R2 0.11 0.70 0.71 0.74
Adjusted R2 0.09 0.69 0.69 0.72
Residual Std. Error 1,041.78 (df = 130) 610.34 (df = 125) 0.06 (df = 125) 0.06 (df = 125)
F Statistic 7.63*** (df = 2; 130) 42.60*** (df = 7; 125) 43.64*** (df = 7; 125)
Note: p<0.1; p<0.05; p<0.01

This is a large table filled with information, but note that it is just providing the information we have seen before side-by-side. We can deduce a few things from this table:

  • First, look at the Adjusted R2 and df (degrees of freedom). We want the highest R2 with the fewest df (we will test this more robustly below). We can see that our right-most three models have a large gain in R2 with just a small loss of df. Of these three models, we have equal dfs, which means we can simply pick the model with the highest R2, which is the robust log regression.

  • Secondly, note that the coefficients between the log model and robust log model are very similar. This is good (as we previously noted that a robust regression is likely unnecessary).

  • Thirdly, note the broad agreement in significance between the three right-most models (this is also reassuring). Also note that the left-most model has wildly different results to the other models, this is because (as we know) that model violates many assumptions of linear modelling.

Once again, if we had only run the left-most model and left it there then we would have had very different conclusions to the other (far better) models.

We can directly compare two models with different degrees of freedom (i.e. different numbers of parameters) using the ANOVA command to understand whether the change in degrees of freedom (dfs) warrants the gain in accuracy. For example when comparing models 1 and 2 (the far left models):

anova(creditreg,creditreg2)
Analysis of Variance Table

Model 1: TotalCredit ~ Maize.adoption + Treatment
Model 2: TotalCredit ~ Maize.adoption + Maize.acres + Treatment + DistrictName + 
    TotalEnrolledSeasons
  Res.Df       RSS Df Sum of Sq     F    Pr(>F)    
1    130 141089401                                 
2    125  46564353  5  94525049 50.75 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that model 2 is significantly better than model 1, and thus our addition of predictors is justified (models 2-4 all have equal dfs, so we can simply examine the unadjusted R-squared values alone). See the FAQ for more information on evaluating model fit.

The final inspection we can make is to compare the fitted vs. actual values for each of our models (comparing like ys). This will enable us to see any limitations in the models as well as make a judgement about the relative gain in accuracy for a model vs. its interpretability.

Comparing the two left-most models (which both have y=TotalCredit) shows a very clear difference in quality:

x1 = predict(creditreg)
x2 = predict(creditreg2)
x3= predict(creditreg3)
x4=predict(creditrob)

ggplot() + geom_point(aes(x=x1,y=sdat$TotalCredit, color="simple")) + geom_point(aes(x=x2,y=sdat$TotalCredit, color="complex")) + ylab("Actual") + xlab("fitted")

Comparing the two right-most models (y=ln(TotalCredit)) reveals very (very) subtle differences (note the largest shifts are on points furthest from the central mass):

ggplot() + geom_point(aes(x=x3,y=sdat$logCredit, color="log")) + geom_point(aes(x=x4,y=sdat$logCredit, color="robust log")) + ylab("Actual") + xlab("fitted")

Finally, let’s compare our complex TotalCredit model with our complex log(Total Credit model) (i..e models 2 and 3), to do this we will exponent the fitted values produced by model 3:

x2 = predict(creditreg2)
x3= exp(predict(creditreg3))


ggplot() + geom_point(aes(x=x2,y=sdat$TotalCredit, color="Model2")) + geom_point(aes(x=x3,y=sdat$TotalCredit, color="Model3")) + ylab("Actual") + xlab("fitted")

r2_lin = summary(creditreg2)$r.squared

We can see that these models also produce highly similar fits.

Overall, the best model is Model 4 (the robust ln model). However given the similarity of the fits, we could chose to use model 2 (untransformed model) for the improved interpretability. The decision (and justification) is up to the analyst.

We have now seen how to analyse clustered data using the summarization method outlined in AMP lesson 2. The next section will cover the analysis of clustered data at the individual level.

6.4 Summary

  • Regression is an artful science.

  • R-squared, p-value and coefficient aren’t the only regression outputs you should examine! Look at the F-statistics, the diagnostic plots, and the graphs of actual vs fitted to understand what your regression model is doing and where its limitations might be!

  • It’s entirely possible to have a good R-squared but a terrible model!

  • Interpreting coefficients from transformed y variables isn’t as simple as performing a back-transform!

  • Directly compare model statistics side-by-side with the stargazer function.

  • Compare model additions with the ANOVA test for goodness-of-fit.

  • The analyst must make a trade-off between model goodness-of-fit and interpretability. I would expect to see a section in every write up outlining, quantifying and justifying this trade off.

7 Regression analysis on clustered data - mixed effect models

In AMP lesson 2 we posited two different solutions to the problem of clustering. The first solution involved summarizing the data so that the unit of analysis and unit of inference match (see above). The second approach involves examining the raw (un-summarized) data itself.

Whilst the overall approach for un-summarized data will be similar to that of summarized data, there is one key difference which motivates us to use slightly different tools:

the clustering in our data means that each farmer is no longer a truly independent data point, and consequently our error estimates will be mis-specified .

As these error estimates are subsequently used in hypothesis testing, incorrect errors can lead to in-accurate p-values and false significance.

7.1 Enter the mixed effect model

To correct the effect of clustering on our errors, we will introduce some new terms into our linear equation. Two key concepts going forward will be:

  • Fixed effects: cluster-invariant effects - think of these as effects constant across individuals

  • Random effects: cluster-variant effects - think of this as compensating for the non-independence in clustered data. Random effects can include data measured at the individual level but randomized at a higher level (as here), but can also include repeated measurements from a single individual (think time-series data, or measuring the strength of someone’s left and right hands).

The key insight here is that in a linear model we had a fairly boring term \(e\):

  • \(Y = C + \beta . X + e\)

Which represented our error, and which we believed to be normally and independently (think randomly) distributed. In mixed modelling we will make our \(e\) term slightly more interesting as we are expecting it to correlate with cluster. In doing this we will allow our mixed model to fit a separate intercept for each site (whilst ensuring that the coefficients are the same). This is analogous to saying that whilst the intercept can differ, the overall rate of change is the same across sites. It is possible to also allow the coefficient to vary by site, but we will not cover that here.

So note that our original linear model above becomes slightly more interesting:

  • \(Y_{nj}= C + \beta . X + U_n + W_{nj} + e\)

We’ve now introduced terms to account for cluster-specific effects (\(U_n\)) and individual-within-cluster effects (\(W_{nj}\)) to yield our estimate for the individual-within-cluster response \(Y_{nj}\). Note that \(\beta . X\) is our unchanged fixed-effect, and \(e\) is still our i.i.d. error term. (This will hopefully become clearer when we look at some model outputs below.)

In a mixed model, we will retain many of the assumptions of linear modelling, e.g.:

  • Normal residuals
  • uncorrelated predictor variables
  • linear relationships

and add the assumption that:

  • random-effects are also normally distributed

The rest of this lesson will examine how to use and diagnose mixed models, first with some simple simulated data, and then with our NCM data at the farmer level.

In summary, the mixed effect model will essentially fit a series of parallel lines with different intercept terms across our data. We will use this distribution of lines to calculate a more accurate standard error of the mean which will yield more accurate p-values.

7.2 Summary

  • Mixed effects models rely on your understanding of the experiment set-up to identify random effects

  • Mixed effects operate under similar assumptions to linear regressions, with the added assumptions around random-effects

  • In a mixed model we retain our i.i.d. \(e\) error, and gain terms for between cluster variation

  • Mixed models will also be used in cases where measurements of a single individual are repeated

8 Theory: Understanding mixed modelling through simulation

Let’s demonstrate how these concepts work and how we can examine the assumptions with some simulation data before proceeding to the more complex NCM data.

set.seed(123)
library(knitr)    
 
indivs <- 12 #aka J
clusters <- 75 #aka N
 
tdf <- data.frame( Cluster = sort(rep(c(1:clusters),indivs)),        Indiv = rep(c(1:indivs),clusters) , x = rnorm(n = indivs*clusters,0,5) )

#we now have X

# lets get intercept diff
inte <- rnorm(clusters,0,50)
ntdf = c()
intey = 25 + inte

#assign distinct beta to each cluster
for(i in seq(1:length(unique(tdf$Cluster))) ) {
  
  cc = unique(tdf$Cluster)[i]
  val = inte[i]
  #print(paste(i,cc,val))
  
  tx = subset(tdf, Cluster==cc)
  tx$inte = val
  
  ntdf = rbind(tx,ntdf)
  
  
}

library(knitr)        

#we now have beta shared within clusters but not between


#make Y

# y = intercept + x * beta + iid noise
ntdf$y <- 25 + ntdf$inte + (ntdf$x * rnorm(clusters*indivs,3,0.5)) + rnorm(clusters*indivs,0,15)
kable(rbind(head(ntdf,3),tail(ntdf,3)), row.names = FALSE)
Cluster Indiv x inte y
75 1 -3.823030 49.30291 80.8144479
75 2 1.976479 49.30291 115.6467368
75 3 -4.952538 49.30291 29.7727725
1 10 -2.228310 -50.70571 -32.2330721
1 11 6.120409 -50.70571 0.6225679
1 12 1.799069 -50.70571 -25.8276100
d = ntdf

In the data.frame above we can see that we have a number of individuals grouped by cluster. Each individual has a unique X (normally distributed) and a beta (normally distributed), as well as a cluster-level intercept term which will result in normal deviations from the intercept. Our aim is to recover our real intercept (and other values):

  • \(intercept\) = 25 (this is the true population level intercept)
  • \(X\) = a normal distribution centered on 0
  • \(\beta\) is a fixed effect (normally distributed \(\mu =3, \sigma = 1\)).
  • \(e\) is our independent error term (i.e. uncorrelated with cluster) (\(\mu =0, \sigma = 15\)).)

To model random and fixed effects together we will use the lme function from package nlme, which requires the following arguments:

  • fixed effects: as a formula of the type y ~ a + b
  • data: the dataframe associated with y,a,b.
  • random: the random effects we want to model:
  • to model only random intercepts we will use random=~1|colname
  • for multiple random intercepts we will specify it as a list `random=list(1|colnameA,1|colnameB)

Additional arguments include:

  • weights: this argument can be used to specify heteroskedastic errors (not covered here)

  • method: the method to find a solution, we will use maximum likelihood ("ML")

  • na.action - what should be done with any NAs in the data (set to na.exclude)

So for our simulation above, our R code will be:

  • lme(fixed=y~ x ,data=d, random=~1|Cluster,na.action=na.exclude,method="ML")
library(nlme)

mixmo <- lme(fixed=y~ x ,data=d, random=~1|Cluster,na.action=na.exclude,method="ML")




#stargazer(mixmo, type = "html", title="Simple mixed model",notes="",  omit.table.layout = "na")
summary(mixmo)
Linear mixed-effects model fit by maximum likelihood
 Data: d 
       AIC      BIC    logLik
  7795.572 7814.782 -3893.786

Random effects:
 Formula: ~1 | Cluster
        (Intercept) Residual
StdDev:    53.03307 14.84361

Fixed effects: y ~ x 
                Value Std.Error  DF   t-value p-value
(Intercept) 23.629657  6.150534 824  3.841887   1e-04
x            3.185321  0.103734 824 30.706484   0e+00
 Correlation: 
  (Intr)
x -0.002

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.97175154 -0.65712377  0.01927532  0.62663978  3.05934749 

Number of Observations: 900
Number of Groups: 75 
x = 0.5/sqrt(dim(d)[1])

Looking at the output, we can ask how our results compare to the equation we specified:

  • Equation specification:
    • Intercept = 25
    • \(\beta\) = 3
    • i.i.d error = \(15\)
    • our random effect is defined as ntdf$inte - which was rnorm(Clusters,0,50).
    • Please make sure you understand the components of the rnorm() command!
  • Model output:
    • random effects =53.217
    • Unexplained residual = \(14.8\)
    • \(\beta\) = 3.1853208, SE = 0.1037345
    • Intercept = 23.6296567, SE = 6.1505341

We can also use the scoring metrics supplied to evaluate this model against simpler or more complex models:

  • AIC is a score that trades-off goodness-of-fit vs. model complexity. This makes it a good, simple, tool to compare model quality.
  • BIC is closely related to the AIC and used similarly.

Both BIC and AIC are measures of model quality, where a lower value represents a better model. Let’s use AIC scoring now to compare our mixed model with a fixed-effect only model:

fmo <-  gls(y ~ x, method="ML", d)

f1 = AIC(mixmo,fmo)
rownames(f1) <- c("Mixed.model","Fixed.model")
kable(f1)
df AIC
Mixed.model 4 7795.572
Fixed.model 3 9774.385

We can see that the AIC score for our mixed model is substantially lower than our fixed model, with only a single extra degree of freedom. This suggests we are warranted in adding random effects to our model.

We can run an ANOVA (using anova.lme from nlme) to directly test whether the improved AIC is significant for our new model:

#formatting for prettiness
an <- anova.lme(fmo,mixmo)
row.names( an ) <- c("Fixed model","Mixed model")
an$call = NULL
options(knitr.kable.NA = '')
an$`p-value` = format(an$`p-value`, scientific=TRUE, na.encode=TRUE)
an$`p-value`[1] <- NA
kable(an, caption = "Comparison of models", digits=c(0,0,0,0,0))
Comparison of models
Model df AIC BIC logLik Test L.Ratio p-value
Fixed model 1 3 9774 9789 -4884
Mixed model 2 4 7796 7815 -3894 1 vs 2 1981 0e+00

The small p-value indicates that the improvement in AIC is indeed significant (note that the p-value corresponds to the “test” column).

Let’s now take a closer look at the outputs from mixed-models and fixed-models to understand the implications of using an incorrect modeling approach:

library(knitr)
tab <- data.frame("Model" = c("Fixed","Mixed"), "Intercept"=c(coef(summary(fmo))[1],coef(summary(mixmo))[1]),"Intercept error"=c(coef(summary(fmo))[3],coef(summary(mixmo))[3]), "X"=c(coef(summary(fmo))[2],coef(summary(mixmo))[2]), "X error"=c(coef(summary(fmo))[4], coef(summary(mixmo))[4]))

kable(tab, digits=2)
Model Intercept Intercept.error X X.error
Fixed 23.59 1.84 3.60 0.37
Mixed 23.63 6.15 3.19 0.10

This table makes it clear that, had we used a simple linear model, we would have under-estimated our error term 3 fold ! Additionally, note:

  • Our \(\beta\) coefficient for X is mostly unchanged between models, this is expected as we defined that our clusters would differ at the intercept term (not the \(\beta\) term).

  • Our intercept error is hugely underestimated by a simple linear model, this would have knock-on implications for our p-value testing, and thus our recommendations.

Now we understand the mixed-model outputs, let’s turn to evaluating the quality of our model (vs. the assumptions outlined at the start of this section).

8.1 Mixed models; diagnostics

There is no equivalent function in R that will plot a variety of graphs for model interrogation, instead, I have included my own function plot_lme which will automatically plot diagnostics from the model. plot_lme takes a few arguments:

  • fob: the lme object
  • realdata: a vector of real data (i.e. the real Y)
  • df: the dataframe used
  • namey: the names of any predictor variables included in the equation.
# load lib
library(nlme)


# a function to plot some lme diagnostics. 
plot_lme = function(fob,realdata,df, namey){
  

  stdr = residuals(fob)/ sd(residuals(fob))
  sqres = sqrt(stdr**2)
  
  par(mfrow=c(2,3))
  
  qqnorm(stdr, ylab='Standardized residuals', main="residuals QQplot")
  qqline(stdr)
  
  # fitted vs actuals
  
  ran = ranef(fob)

#plot(fitted(fob), realdata, main="fitted vs actual",xlab="Fitted",ylab="actual")
  plot(ran$`(Intercept)`, seq(1:length(ran$`(Intercept)`)), main="random effects", ylab="Site number", xlab="Random effect")
  plot
# fitted vs resid
  plot(fitted(fob), stdr,  main="fitted vs residuals",xlab="Fitted",ylab="standardized residuals")

# estimate leverage
  ma = as.matrix(df[,c(namey)])
  #ma = as.matrix(dat[,c('TotalEnrolledSeasons', 'Maize.adoption','Maize.acres')])
  #rm(hatty)
  # want X * (Xt * X)^-1 * X
  hatty = ma %*% solve(t(ma) %*% ma) %*% t(ma)  
  #plot leverage against std.resid.
  plot(diag(hatty), stdr, main="Leverage-residuals",xlab="est.Leverage",ylab="residuals")
  
  res_lme=residuals(fob)
  plot(res_lme, main="Residuals vs index", ylab="resid")
  abline(a=0,b=0, lwd=4, col="red")
  
  
  qqnorm(ranef(fob)$`(Intercept)` , main="Random effects QQplot")
  qqline(ranef(fob)$`(Intercept)` )
  
}

plot_lme will then produce the following graphs:

plot_lme(mixmo, d$y, d, c("x"))

From top-left to bottom right:

  • QQplot of residuals: This should be interpreted as in a linear model

  • Scatter plot of random effects: This is a plot of the random effects vs cluster ID. We are hoping to see a random scatter of data points indicating that the random effects are independently distributed

  • fitted vs residuals: This should be interpreted as in a linear model

  • (estimated) Leverage vs residuals: This should be interpreted as in a linear model

  • residuals vs index: This should be interpreted as in a linear model

  • QQplot random effects: This plot should be used to determine if the random effects are normally distributed, in conjunction with the random-effects plot.

Finally we can plot the fitted vs actual values for our mixed model and our LM model:

d$fitt = fitted(mixmo)
d$fmol = fmo$fitted
g1 = ggplot(d) +geom_point(aes(x=fitt, y=d$y), color="blue") + xlab("Fitted") + ylab("Actual") + ggtitle("Fitted vs actuals - mixed model")  + xlim(-25,75) + ylim(-25,75) + geom_smooth(aes(x=fitt,y=d$y),method="lm") 

g2 = ggplot(d) +  xlab("Fitted") + ylab("Actual") + ggtitle("Fitted vs actuals - LM model")  + xlim(-25,75) + ylim(-25,75) + geom_smooth(aes(x=fmol,y=d$y),method="lm") + geom_point(aes(y=fmol, x=d$y),color="red")

#g3 =  ggplot(d) +  xlab("Fitted") + ylab("Actual") + ggtitle("Fitted vs actuals - LM model")  + xlim(-25,75) + ylim(-25,75) + geom_smooth(aes(x=fmol,y=d$y),method="lm") + geom_point(aes(y=film, x=d$y),color="red")

multiplot(g1,g2,cols=2)

This looks great, the mixed-model performs well at predicting all values of Y. Clearly our mixed model has done a good job of separating our fixed and random effects!

This is a very simple example of a mixed model where only our intercept is random, and where there is no relationship between our intercept and slope. Real data will often be less forgiving!

8.1.1 Summary

  • Mixed effect models contain both random and fixed effects

  • In this context, random effects refers to the result of randomizing at the cluster level but measuring at an individual level

  • Similar techniques are also applicable to data where multiple readings are taken from single subjects (this will be covered in the upcoming AMP FAQ)

  • We can specify random slopes, or intercepts, or both. We can F-test models to understand which additions satisfy the trade-off between goodness-of-fit and complexity.

  • As well as the diagnostic plots we saw in lm modelling, we additionally need to check the distribution of random effects from lme.

  • You can see some more advanced simulations here.

9 Practical: mixed effect modelling NCM

We now have a much more complicated data set to model, the non-compulsory maize trial results. Remember, that these were collected at the farmer level but randomization occurred at the Site level.

Let’s recall what our un-summarized NCM data looks like:

str(dat)
'data.frame':   22396 obs. of  13 variables:
 $ OAFID               : int  43141 42509 44278 44284 44280 44283 44282 44271 44268 44272 ...
 $ solar               : num  0 1 1 1 1 1 1 1 0 1 ...
 $ GroupName           : Factor w/ 2290 levels "Abalibayo","Abeingo",..: 1484 2032 1583 1583 1583 1583 1583 1484 1484 1484 ...
 $ X..Repaid           : num  27.1 3.5 49.3 27.9 16.8 26.1 20.4 19.4 26.1 10.8 ...
 $ TotalEnrolledSeasons: int  2 2 1 1 1 1 1 1 1 1 ...
 $ NewMember           : Factor w/ 2 levels "False","True": 1 1 2 2 2 2 2 2 2 2 ...
 $ Treatment           : Factor w/ 2 levels "C","T": 2 2 2 2 2 2 2 2 2 2 ...
 $ DistrictName        : Factor w/ 4 levels "Kabondo","Kakamega (South)",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ SiteName            : Factor w/ 133 levels "Achuth","Ainapngetuny",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Maize.acres         : num  0.25 0.5 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 ...
 $ Cookstove.qty       : int  1 0 0 0 0 0 0 0 0 1 ...
 $ Maize.adoption      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ TotalCredit         : int  7520 14065 8775 10215 13795 7870 11260 14895 5920 14650 ...

We can see that we have 22396 rows of data, representing one row per farmer. Let’s run an lme model, using the same fixed effects as our final lm model in the first half of this lesson.

ctrl<-lmeControl(maxIter=400,msMaxIter=400, niterEM=400)
#vf1<-varIdent(form= ~1|DistrictName)
#vf2= varPower(form = ~ TotalEnrolledSeason) 
cf<-formula(TotalCredit~  Maize.acres + TotalEnrolledSeasons  + Treatment + DistrictName  + Maize.adoption + solar )

model_lme <- lme(cf,data=dat, random=~1|SiteName,control=ctrl,na.action=na.exclude,method="ML")

small <- lme(formula(TotalCredit~  Maize.acres + TotalEnrolledSeasons  + Treatment + DistrictName   + Maize.adoption ),data=dat, random=~1|SiteName,control=ctrl,na.action=na.exclude,method="ML")



m2 <- aov(TotalCredit~  Maize.acres + TotalEnrolledSeasons + NewMember + Treatment + DistrictName   , data = dat)

 


# xl = lm(fitted(model_lme) ~dat$Maize.acres)
# plot(dat$Maize.acres,fitted(model_lme) )
# abline(xl)

Before we turn to the model outputs, let’s look at our diagnostic plots:

plot_lme(model_lme, dat$TotalCredit, dat, c('TotalEnrolledSeasons', 'Maize.adoption','Maize.acres'))

# library(zoo)
# ggplot() + geom_line(aes(y=rollmean(fitted(model_lme)-dat$TotalCredit,100), x= seq(1:length(rollmean(fitted(model_lme)-dat$TotalCredit,100)))) )
# plot(rollmean(fitted(model_lme)-dat$TotalCredit,100))

The random effects QQplot shows that our random effects are approximately normally distributed (with a few notable outliers). However, the most concerning plots here are the residuals QQplot and the fitted vs residuals plot, both of which show that our model residuals are non-normal. Let’s examine this in more depth with a fitted vs actuals plot:

d = data.frame(fitt=as.numeric(fitted(model_lme)) , y=as.numeric(dat$TotalCredit))




ggplot(d) +geom_point(aes(x=fitt, y=d$y), color="blue")+ xlab("Fitted") + ylab("Actual") + ggtitle("Fitted vs actuals") + geom_smooth(aes(x=fitt,y=d$y),method="lm") +xlim(0,30000) + ylim(0,30000)

xl = lm(fitted(model_lme) ~dat$TotalCredit)
r2_mixmo = summary(xl)$r.squared

The plot above represents a quick summary of our model performance. Note that the regression line appears to be off center from the mass of points. Let’s try looking at this another way for clarity:

d$resi = (d$y-d$fitt)

ggplot(d) + geom_point(aes(x=y,y=resi), color="#56B4E9", alpha=0.8) + ylab("Residual") + xlab("Total Credit (actual)") + geom_smooth(aes(x=y,y=resi),method="lm", se=TRUE, n=100) 

This plot shows some important issues with our model:

  • Our model is biased towards lower Total Credit values (where there are more points, and where minimization of residuals is more effective)

  • There is some heteroskedascity in our model, however this is mostly generated by the model under-estimating higher TotalCredit values.

This suggests our model is performing poorly at medium and high TotalCredit values (this, in essence, is also what our diagnostics above show). Note that simply examining the r-squared of the model (0.726) would be misleading. It is possible to have a good r-squared but a terrible model.

Let’s use some creative plotting to show these trends with more clarity. To do this, we will plot another “fitted vs actual” and “fitted vs residuals” plot, but with summarized data to make the graphs easier to read. The code below will plot the average fitted values for each actual TotalCredit value:

library(zoo)
new = data.frame("actual"=round(dat$TotalCredit,-3), "fitted"=as.numeric(fitted(model_lme)) )

new$diff = dat$TotalCredit - as.numeric(fitted(model_lme))
xm = new %>%
   group_by(actual) %>%
   summarise(yy=mean(fitted), ydiff=mean(diff), ylen=length(fitted))

xm$ylen = (xm$ylen / sum(xm$ylen) )*100


g1 =ggplot() + geom_point(aes(xm$actual, xm$yy),color="blue") + xlab("Actual (rounded)") + ylab("mean(fitted)") + xlim(0,30000) + ylim(0,30000) + geom_abline(size=1.5) + annotate("text", x = 26000, y = 30000, label = "abline")

g2 = ggplot() + geom_point(aes(xm$yy, xm$ydiff),color="blue") + xlab("Mean(fitted)") + ylab("Mean(residuals)")

p<-ggplot(data=xm, aes(x=actual, y=ylen)) +  geom_bar(stat="identity")+ xlab("Actual (rounded)") + ylab("Frequency (%)")


multiplot(g1,g2,cols=2)

multiplot(p,cols=2)

These graphs let’s us see three things easily:

  • Our fits for high TotalCredit scores are biased down, leading to under-estimates of high Total Credit values

  • Our residuals increase at higher predictions

  • We have a much larger mass of data points at low TotalCredit than high TotalCredit

Together, these graphs suggest that we have a large mass of data at low TotalCredit values that is biasing our model, causing it to under-estimate high Total Credit values.

We can see this in the following graph:

d <- subset(dat, TotalCredit > 15000)

model_lmesub <- lme(cf,data=d, random=~1|SiteName,control=ctrl,na.action=na.exclude,method="ML")


da = data.frame(fitt=as.numeric(fitted(model_lmesub)) , y=as.numeric(d$TotalCredit))




ggplot(da) +geom_point(aes(y=fitt, x=da$y), color="blue")+ ylab("Fitted") + xlab("Actual") + ggtitle("Fitted vs actuals") + geom_smooth(aes(y=fitt,x=da$y),method="lm") +xlim(15000,30000) + ylim(15000,30000)

Observe around \(y=17000\) and \(y=19000\) we see two parallel clusters of points. These clusters remain stubbornly at at \(y=17000\) and \(y=19000\), whilst our actual TotalCredit varies from (approx) 15,000 to 20,000. I suspect it is these clusters that are causing the behavior examined above.

At this point, we would want to explore any additional variables that might help to further discriminate between farmers with lower TotalCredit values. However, our data set does not include additional data for us to include. Alternatively, we could weigh points with higher TotalCredit more than points with low TotalCredit, this would improve performance at higher TotalCredit values at the expense of lower TotalCredit values (where the majority of our farmer data points actually lie).

Weighting our data isn’t attractive, and we have no additional data to add to our model, therefore we are unable to improve our model any further. We will proceed by instead interpreting our model with the caveats outlined above in mind.

It is worthwhile to note at this point that all models are flawed, but it is important to investigate exactly where these flaws are and how they might change your interpretation.

For our mixed model, it is clearly capable of explaining a large amount of variance in our data, however, it will systematically under-estimate larger TotalCredit values.

Before we conclude the mixed-modelling portion of the lesson, let us look at one more fundamental question, was the inclusion of random effects statistically warranted, i.e. do we have evidence for group level variation in transaction size?

9.1 Evaluating mixed effects

To understand whether the inclusion of random effects is statistically warranted in our model, we will compare the predictions and AIC scores of our mixed model, with a simpler fixed-effect only model.

First, let’s examine the fits of our mixed model and our fixed model (we’ll first plot all the data, and then some random subsets for clarity):

fmod <-  gls(cf, method="ML", dat)



df = data.frame(real=dat$TotalCredit, fix=fitted(fmod), mix=fitted(model_lme))


ggplot(sample_frac(df,1)) + geom_point(aes(x=real,y=fix,colour="fixed")) + geom_point(aes(x=real, y=mix,colour="mixed")) + xlab("Actual") + ylab("Fit")

set.seed(111)
g1 = ggplot(sample_frac(df,0.01)) + geom_point(aes(x=real,y=fix,colour="fixed")) + geom_point(aes(x=real, y=mix,colour="mixed")) + xlab("Actual") + ylab("Fit")

set.seed(222)
g2 = ggplot(sample_frac(df,0.05)) + geom_point(aes(x=real,y=fix,colour="fixed")) + geom_point(aes(x=real, y=mix,colour="mixed")) + xlab("Actual") + ylab("Fit")

multiplot(g1,g2,cols=2)

Hmm, it looks like there are some differences in fit quality. Let’s test model AICs with an ANOVA:

an2 = anova.lme(fmod,model_lme) 


row.names( an2 ) <- c("Fixed model","Mixed model")
an2$call = NULL
options(knitr.kable.NA = '')
#an2$`p-value` = format(an$`p-value`, scientific=TRUE, na.encode=TRUE)
#an2$`p-value`[1] <- NA
kable(an2, caption = "Comparison of models", digits=c(0,0,0,0,0))
Comparison of models
Model df AIC BIC logLik Test L.Ratio p-value
Fixed model 1 10 408062 408142 -204021
Mixed model 2 11 407515 407603 -203746 1 vs 2 549 0

We see a significant decrease in AIC with our mixed model (and this is statistically significant at the p<0.05 level). This is a good indicator that we do indeed see random level effects (in this context, it means have evidence for Site level deviations from the population mean). Our model output suggests that random effects explain ~400 KES of the variation in our data (this information can be accessed via the summary() command, as stargazer doesn’t support random effect reporting).

NB. Please note that AIC scoring should not be minimized at all costs, but rather, as with our LM model choice, should be weighed against other model parameters.

Overall then, we can say that our model does a reasonable job at explaining TotalCredit variance, that adding random effects is warranted. But that our model will tend to underestimate higher TotalCredit values. Let’s now look at our model summary:

stargazer(model_lme, type = "html",style="all", title="Mixed model regression")
Mixed model regression
Dependent variable:
TotalCredit
Maize.acres 9,452.531***
(54.362)
t = 173.880
p = 0.000
TotalEnrolledSeasons 360.691***
(11.158)
t = 32.326
p = 0.000
TreatmentT 80.684
(81.150)
t = 0.994
p = 0.322
DistrictNameKakamega (South) -609.307***
(107.360)
t = -5.675
p = 0.00000
DistrictNameMigori 394.658***
(109.596)
t = 3.601
p = 0.0005
DistrictNameTinderet -168.517
(142.908)
t = -1.179
p = 0.241
Maize.adoption -192.767**
(82.832)
t = -2.327
p = 0.020
solar 3,721.025***
(27.571)
t = 134.962
p = 0.000
Constant 2,427.642***
(117.315)
t = 20.693
p = 0.000
Observations 22,396
Log Likelihood -203,746.400
Akaike Inf. Crit. 407,514.800
Bayesian Inf. Crit. 407,603.000
Note: p<0.1; p<0.05; p<0.01

This model suggests that almost all effects are statistically significant, except for treatment! Note the large p-value for treatment status too.

Let’s summaries our mixed-model learnings before the final part of this lesson, model selection.

9.2 Mixed effect models Summary

  • Mixed effect models are useful when we suspect clustering in our data (either by repeated measurement or group-level effects)

  • We can assess our data for clustering either through the ICC method (seen in AMP lesson 2) or through ANOVA testing fixed and mixed effect models

  • We have to check our models for the standard linear model assumptions (normal residuals, etc) as well as for normality of random effects.

  • We haven’t dealt with correlated variables in mixed (or fixed) models yet, there is an excellent paper here, which will be dealt with in AMP lesson 6

  • There is also a more advanced lesson on mixed models here, for those with better R and mathematical skills.

10 Model selection and results interpretation

At the conclusion of this lesson we have three sets of results:

  • Our hypothesis tests

  • Our summarized model, with an r-squared of 0.705

  • Our mixed model, with a (pseudo) r-squared of 0.726

All three of these results suggest that the intervention (non-compulsory maize) had no significant effect on TotalCredit (within the framework of limited power and MDEs discussed above).

However, we might want to understand some of the other factors that affect transaction size (TotalCredit) from our regression models:

#stargazer(model_lme, creditreg2)

stargazer(creditreg2, model_lme, type="html", title="Comparison of models", column.labels = c("Summarised model","Mixed model") , dep.var.labels.include = FALSE, dep.var.caption = "Dependent variable: Total Credit ", digits=2 , omit = c("solar","NewMember")) 
Comparison of models
Dependent variable: Total Credit
OLS linear
mixed effects
Summarised model Mixed model
(1) (2)
Maize.adoption 92.35 -192.77**
(977.58) (82.83)
Maize.acres 6,082.33*** 9,452.53***
(793.88) (54.36)
TreatmentT -59.28 80.68
(121.60) (81.15)
DistrictNameKakamega (South) -936.62*** -609.31***
(178.24) (107.36)
DistrictNameMigori 1,009.89*** 394.66***
(159.29) (109.60)
DistrictNameTinderet 42.45 -168.52
(337.05) (142.91)
TotalEnrolledSeasons 258.07* 360.69***
(133.00) (11.16)
Constant 5,849.94*** 2,427.64***
(729.89) (117.32)
Observations 133 22,396
R2 0.70
Adjusted R2 0.69
Log Likelihood -203,746.40
Akaike Inf. Crit. 407,514.80
Bayesian Inf. Crit. 407,603.00
Residual Std. Error 610.34 (df = 125)
F Statistic 42.60*** (df = 7; 125)
Note: p<0.1; p<0.05; p<0.01

We can note from the above that either formulation of the model results in treatment status being non-significant with regards to Total Credit. We can also see broad agreement in the significance in Maize.acres and District (and some agreement in TotalEnrolledSeasons). The most significant disagreement between these models is the effect of Maize.adoption on TotalCredit.

Contrasting the mixed model and summarized model, we can see that the summarized model yields a lower R-squared. However, our summarized model produces healthier diagnostics and does not show signs of bias (vs. our mixed model). When comparing models this graphic comes in handy:

knitr::include_graphics("./var_bias.png")

This diagram neatly sums up the common trade-offs we make when selecting models. Take a moment to look at the difference between bias and variance to understand what trade-off we are weighing up between our summarized model and mixed model.

In this case, I would be more willing to trust the results from our summarized model than our mixed-effect model, that is, I would choose a model with higher variance and lower bias.

Note though, that whilst there is a difference of significance between the summarized model and the mixed-effect model, the overall magnitude of the coefficient is very small (~200 KES, or 2 USD). Therefore the importance of this finding, one way or another, is minimal, and we can dismiss it as being an un-important driver of transaction size (regardless of its p-value).

Overall, the summarized model produces fairer diagnostics and is easily interpretable. Therefore this is our model of choice for this analysis:

stargazer(creditreg2, type="html", title="Model of choice" , dep.var.labels.include = FALSE, dep.var.caption = "Dependent variable: Total Credit ", digits=2 , omit = c("solar","NewMember")) 
Model of choice
Dependent variable: Total Credit
Maize.adoption 92.35
(977.58)
Maize.acres 6,082.33***
(793.88)
TreatmentT -59.28
(121.60)
DistrictNameKakamega (South) -936.62***
(178.24)
DistrictNameMigori 1,009.89***
(159.29)
DistrictNameTinderet 42.45
(337.05)
TotalEnrolledSeasons 258.07*
(133.00)
Constant 5,849.94***
(729.89)
Observations 133
R2 0.70
Adjusted R2 0.69
Residual Std. Error 610.34 (df = 125)
F Statistic 42.60*** (df = 7; 125)
Note: p<0.1; p<0.05; p<0.01

11 Lesson summary

This has been a large lesson, but the key take-homes are:

  • Power calculations and MDEs will enable you to understand the limitations of your study, both for hypothesis testing and regression analysis

  • Regression assumptions and diagnostics are incredibly important. Remember to check your regression for:
    • Normal residuals
    • Linear relationships
    • Lack of high-leverage points
    • Normal random effects (if they’re included)
  • We define a good model as one that can:
    • Explain the full range of variable Y
    • Is unbiased
    • Can produce robust confidence intervals
  • Present all relevant materials (diagnostics, justifications) in your reports, you can place these in the appendix.

12 Analysis check-list

When conducting an analysis, be sure to check and include the following:

  • Check hypotheses are robustly defined
  • Check power and MDEs for each key hypothesis
  • Hypothesis test each hypothesis with the correct test!
  • Check the pre-analysis plan for guidance on whether the analysis is to be carried out by summarization or mixed modelling (most likely, it will be the former)
  • Carry out regression analysis, testing assumptions with model-plots as you go
  • Keep these tests/plots in an appendix for future readers!
  • test model quality through ANOVA testing to ensure model additions are warranted.
  • Clearly state any model caveats or limitations in the main body of the text!
  • Present any justifications for analysis or model choices (even if it seems obvious) in the appendix.