1 AMP lesson: RCT analysis part 2

1.1 Lesson objectives:

We previously covered the key parts of RCT analysis in AMP lesson 3, where we used hypothesis testing and linear regression to understand the effects of a 2016 Kenya program innovation trial.

We will now turn to a second major set of analyses that are common at One Acre Fund; the analysis of historical data. Historical data refers to any non-RCT data, and could include analysing country-wide repayment data to find key correlations, or analysing loan components to understand product adoption trends. At One Acre fund this is often call “driver analysis”, however this is incorrect as driver implies an understanding of causality, which we cannot get with non-RCT data.

Whilst historical data is unable to inform us on causality, it is incredibly useful for finding new hypotheses, and for assessing our understanding of the relationships between variables.

In this lesson we will analyse synthetic OAF data using two new tools: logistic regression and mixed-effect logistic regression. Logistic regression is useful when your outcome variable is binary (e.g. anything with a true/false, yes/no outcome). Whilst the context for this lesson will be non RCT data, the tools are readily applied to RCT data.

The aim of this lesson, therefore, is to both introduce some new tools, and to demonstrate how our interpretation of results changes in a non-RCT vs. an RCT setting.

As usual, I will explain the steps we are taking and the logic behind them, after this lesson you should be fully able to analyse a wide variety of One Acre Fund data sets.

1.2 Glossary and key terms:

  • Continous, categorical, ordinal variables - types of variables common in data analysis (see link)

  • P-value: The probability that the observed result could be generated purely by chance (i.e. the probability there is no real difference between populations)

  • Power: The probability that we are able to detect a difference between two populations, given a certain sample size, distribution and error rate. Note that power also has an effect on how robust our results are.

  • Effect Size: The standardized difference between 2 groups. Often it has the units stdev. So an effect size of 1 is equal to a difference of 1 stdev.
  • Hypothesis test: a statistical test. Each hypothesis test has two parts, a null hypothesis which says that the two populations are the same (i.e. there is 0 difference) and an alternative hypothesis which says that there is a difference between populations Often the null hypothesis is called H0 and the alternative hypothesis is called H1.

  • Odds-ratio (OR): The OR represents the odds that an outcome will occur given a particular variable, compared to the odds of the outcome occurring in the absence of that variable

  • Logistic regression: Logistic regression is used to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

2 Theory: logistic regression

2.1 Inspecting the data

We will introduce logistic regression using a synthetic data set that neatly resembles real OAF data. In this scenario, we have rolled out a new solar home system across four districts in Kenya, and now want to understand the relationships between adoption of solar and other variables of interest. Note that as this is not an RCT, we have to be very careful with our causality statements; essentially, we will be able to say that A correlates with B, but not that A causes B.

Let’s look at the data set :

dat <- read.csv("AMP4_data.csv")
str(dat)
'data.frame':   891 obs. of  7 variables:
 $ OAFID      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Solar      : Factor w/ 2 levels "N","Y": 1 2 2 2 1 1 1 1 2 2 ...
 $ Acres      : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Gender     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Expenditure: num  4.55 2.63 3.85 2.86 2.86 ...
 $ Repayment  : num  97.8 84.1 96.1 98.6 96 ...
 $ District   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
  • OAFID - a unique ID for each and every farmer
  • Solar - Y/N for whether the farmer purchased a solar home system in 2017
  • Acres - the combined (rented and owned) total land size available for planting in 2017
  • Gender - the gender of the farmer
  • Expenditure - weekly expenditure (US dollars) on lighting
  • Repayment - Loan repayment percentage at season end
  • District - Single letter identifier for district

We want to understand the factors associated with solar adoption, this makes solar our outcome variable. The first thing we will want to check is how balanced our data is. Severely unbalanced data (e.g. most events are “Y” rather than “N”) can be fatal to a logistic regression, so this is a key assumption we will want to explore before we proceed:

table(dat$Solar)

  N   Y 
549 342 

We can see that whilst our data is somewhat unbalanced, it is not severely so. Approximately 40% of our farmers in the data set adopted solar. If we were to see a 20/80 or 30/70 split, then we would want to explore alternative tools for analysis.

As in AMP lesson 1, we will want to begin by exploring the data. We can begin with some box plots for continuous variables:

library(ggplot2)


# boxplot and remove redundant legend. 
g2 <- ggplot(dat, aes(x=Solar, y=Expenditure, fill=Solar)) + geom_boxplot() +
    guides(fill=FALSE) + ylab("Expenditure")

# boxplot and remove redundant legend. 
g3 <- ggplot(dat, aes(x=Solar, y=Repayment, fill=Solar)) + geom_boxplot() +
    guides(fill=FALSE) + ylab("Repayment")



multiplot(g2,g3,cols=2)

There are some subtle differences here, it looks like there may be a weak relationship between expenditure and solar adoption, whilst it’s difficult to see any relationship between solar and repayment percentage.

We can also explore factor-factor relationships with a cross-table:

xtabs(~Solar  + Gender, data = dat)
     Gender
Solar female male
    N     81  468
    Y    233  109

This table suggests that female farmers are more much likely than male farmers to adopt solar.

Remember that whilst it is important to explore the underlying data (for interesting trends, outliers etc), only a regression will be able to pick apart relationships and so give estimates for a variable in isolation.

2.2 Theory: Running a logistic regression

Now we have explored the data, let us begin building a rudimentary logistic regression model.

Logistic regression works by using a logistic distribution to describe the probability of an event occurring:

In the above graph, our yvar is our categorical variable (e.g. a Y/N, True/False, Healthy/Sick, etc, event that takes on coded binary values). This variable can only take values of 0 or 1. In the above graph, we can see that Xvar shows some relationship to yvar, and we have fitted a logistic probability distribution to the variables.

From the graph above, note:

  • The open circles represent data points, and the green line represents the class probability at a value of ‘xvar’
    • for xvar>200, we are pretty sure yvar=0
    • for xvar<60 we are pretty sure yvar=1
    • Between xvar=150 and xvar=100 we have a very steep slope where our probability rapidly changes

Unlike in a linear regression, which outputs fitted values, our logistic regression will output the probability (i.e. the green line above) that each observation belongs to a certain class (Y/N, healthy/sick etc). We can see in the above graph that we are observing a fairly strong relationship between Xvar and Yvar, that is, there is a substantial difference in the green line at low vs high xvar values.

Let’s now look at a weaker relationship, specifically solar ~ Expenditure from our synthetic OAF data - remember that we earlier hypothesized that Expenditure might correlate with solar based on the box plots we saw:

Notice the stark difference between the two graphs above! The relative flatness of this curve indicates that there is no relationship between solar ~ Expenditure. We can therefore be confident that this variable will not be statistically significant (i.e. there is a low probability that the slope of green line above is non-zero?)

We can visualize the above relationships as we only have one dependent and one independent variable. When we start adding more variables it is more difficult to clearly visualize each relationship (hopefully this concept is familiar from linear regression, AMP lesson 3). We will remember that logistic regression (and linear regression) will give us the effect of variable X1 when all other variables are held constant. It is this ability to separate out effects which makes regression incredibly useful for picking out trends and relationships.

To run our first logistic regression we will call the glm function from base R. the glm function expects several arguments:

  • formula: A formula for the relationship you want to test, e.g. Solar ~ Gender + Expenditure
  • family: The type of linear model you want to run, in this case binomial
  • data: The dataframe containing your data

Our first model will look only at Gender and Expenditure variables, and will be called as follows:

lg1 <- glm(Solar ~ Gender + Expenditure, data = dat, family = "binomial")

As with the linear models in AMP lesson 3, we will want to check our model’s performance to understand whether (and to what extent) we can trust the results. Unlike a linear regression, the output from a logistic regression requires a little more processing.

2.3 Logistic regression, classes and probabilities

As previously mentioned, when examining the logistic regression outputs, we are actually examining the probabilities that each row belongs to a certain class. This is usually the probability of belonging to the first class (which here is Solar=Y).

Let’s examine a density plot of these raw probabilities:

ggplot() + geom_density(aes(lg1$fitted.values),alpha=.3, bw=0.02, fill="grey") + xlab("Class probability")

This is a good looking density plot! Notice that we have two, very clearly separated, humps of probabilities around ~0.2 and ~0.7. The gulf between these humps suggests that our model is doing a good job of discriminating between our classes.

To convert these probabilities into predicted classes (e.g. “y” or “n”) we will need to draw a vertical line somewhere on that graph and call everything below that line “N” and everything above that line “Y”. This is know as optimizing the cutoff and is a common challenge when regressing categorical variables.

If we set our boundary at 0.5 and convert our probabilities into classes we get the following cross table:

ndat <- dat

#save the probs
ndat$probs <- lg1$fitted.values

# convert probabilities into labels
ndat$pred_solar <- as.character(ndat$probs)
ndat$pred_solar[as.numeric(ndat$pred_solar)<0.5] <- "N"
ndat$pred_solar[as.numeric(ndat$pred_solar)>=0.5] <- "Y"

xtabs(~Solar + pred_solar, data=ndat)
     pred_solar
Solar   N   Y
    N 468  81
    Y 109 233

The table above is known as a confusion matrix, we have the actual data classes on the left, and the predicted classes on the top. Our aim is to match up as many “N”s and “Y”s and minimize the mistakes (i.e. when a solar “N” is predicted as a “Y”).

Note that each quadrant of our confusion matrix has a special name:

knitr::include_graphics("./confusion_matrix_1.png")

We can see that 468 Ns and 233 Ys are correctly matched up, we could crudely then say that our overall accuracy is (468+233)/891 = 78.6%. Not bad! (Remember that a 50% accuracy would be the same as a pure guess.) One issue with this accuracy metric, however, becomes clear when we have unbalanced data. For example, if we had a 95/5 % split in classes, then our model could reach a 95% accuracy simply by predicting all classes to be class 0. We will examine alternative accuracy measures later.

From the confusion matrix, we can say that we have 81 false positives, and 109 false negatives. So using the language in the figure above, we can say the aim of a good logistic regression is to maximize true positives and true negatives, and minimize false positives and false negatives.

Now that we have a useful language to describe our results, let’s revisit that denisty plot, but this time, we will split these fitted probabilities up by the actual class:

g1 = ggplot(subset(ndat, Solar=="Y"), aes(x=probs, fill=Solar)) + geom_density(alpha=.3, bw=0.02,fill="deepskyblue") + xlab("Probabilities") + ggtitle("Solar = Y")
g2 = ggplot(subset(ndat, Solar=="N"), aes(x=probs, fill=Solar)) + geom_density(alpha=.3, bw=0.02, fill="coral1") + xlab("Probabilities")  + ggtitle("Solar = N")

multiplot(g1,g2,cols=2)

Hmm, not so good. This plot allows us to visualize where our errors are (e.g. the blue curve at ~0.2 indicates a false negative, as the model is predicting that hump of solar=Y farmers to be solar=N farmers). Therfore, whilst the model does reasonable well (witness the high accuracy score above), it is certainly not perfect. The most worrying part of this is the false confidence of the model, as our false probabilities line up neatly with our true probabilities. If we overlay the plots from above, we can see this clearly:

ggplot(ndat, aes(x=probs, fill=Solar)) + geom_density(alpha=.3, bw=0.02) + xlab("Probabilities")

The fact that our errors sit so snugly within our true values suggests that it is difficult to optimize this model to enable further class discrimination. You can attempt to do this manually, try placing a vertical line on the x-axis that maximizes our true negatives/positives and minimises our false negatives/positives, you’ll see you very quickly have to start making trade-offs.

We will look at how to improve this model a little later, but first, let’s look at other methods to evaluate model performance.

2.4 Better measures of performance

As mentioned above, the crude \(true positive + true negative / total N\) suffers from a few flaws. We will therefore also introduce the ROC curve.

Rather than calculating the false positive and false negative rate for each and every model, we can use a ROC curve to quickly understand the relationships between predicted probabilities and accuracy:

library(pROC)


rocr_ <- roc(ndat$Solar, ndat$probs)
plot.roc(rocr_, xlim=c(1,0), ylim=c(0,1), xlab="True negative rate", ylab="True positive rate")

A good ROC curve will have a very steep line that quickly plateaus, indicating a high TP rate, and a high TN rate (examine the graph above to understand why this is). Note also that the diagonal line indicates a model with an accuracy equal to random guessing.

Finally, we can take the integral of the graph above to quickly summarize the accuracy of our model. This integral is known as AUROC, or Area Under the ROC curve, and runs from 0.5 (random guess) to 1.0 (perfect accuracy).

We can calculate the AUROC for our model above as follows:

rocr_$auc
>> Area under the curve: 0.7735

An AUROC of 0.77 is reasonably good (an AUROC of 0.5 indicates a model that is as good as random guessing, regardless of class balance).

We now have a quick method to score our logistic regression, this allows us to iteratively add features and test model performance.

2.5 Improving the logistic regression

We now have a basic logistic regression working, before we examine the results from our logistic model, let’s try improving it (i.e. increasing the AUROC score).

As with a linear regression (see AMP lesson 3), there are two simple ways to improve a fit:

  • Add variables
  • transform data

Let’s begin by including more variables that we think might improve our fit.

lg2 <- glm(Solar ~ Gender + Expenditure + Acres + District + Repayment , data = dat, family = "binomial")

Let’s also add an interaction term for gender and expenditure, based on the hypothesis that expenditure is modified by farmer gender.

lg3 <- glm(Solar ~ Gender + Expenditure + Acres + District + Repayment + Gender*Expenditure , data = dat, family = "binomial")

Now, before we proceed to model scoring, let’s use an ANOVA test to understand whether these additional terms are warranted. We can do this in two ways.

  1. We could test models directly against each other
anova(lg1,lg2, lg3,test="Chisq")
>> Analysis of Deviance Table
>> 
>> Model 1: Solar ~ Gender + Expenditure
>> Model 2: Solar ~ Gender + Expenditure + Acres + District + Repayment
>> Model 3: Solar ~ Gender + Expenditure + Acres + District + Repayment + 
>>     Gender * Expenditure
>>   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
>> 1       888     916.06                          
>> 2       883     817.19  5   98.862 < 2.2e-16 ***
>> 3       882     809.94  1    7.250  0.007092 ** 
>> ---
>> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that our model 2 is significantly better than model 1, but our model 3 (with the interaction term) is even better. The improved performance of model 3 therefore warrants the change in degrees of freedom (see AMP lesson 3 for a refresher).

  1. we could sequentially test each variable to see whether it significantly improves model fit:
anova(lg3, test="Chisq")
>> Analysis of Deviance Table
>> 
>> Model: binomial, link: logit
>> 
>> Response: Solar
>> 
>> Terms added sequentially (first to last)
>> 
>> 
>>                    Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
>> NULL                                 890    1186.66              
>> Gender              1  268.851       889     917.80 < 2.2e-16 ***
>> Expenditure         1    1.748       888     916.06  0.186148    
>> Acres               1   92.960       887     823.10 < 2.2e-16 ***
>> District            3    5.899       884     817.20  0.116615    
>> Repayment           1    0.003       883     817.19  0.952867    
>> Gender:Expenditure  1    7.250       882     809.94  0.007092 ** 
>> ---
>> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This ANOVA test suggests that only Gender, Acres and the interaction term Gender:Expenditure improve our fit significantly. At this stage we could remove all other variables, however we will keep them in for now as they will be helpful in explaining coefficient interpretation.

Note that finding a significant interaction term indicates that the effect of one predictor variable on the response variable is different for different values of the other predictor variable. i.e. the association of Expenditure and Solar is different for males and females.

Let’s now view the confusion matrix from the third model:

ndat <- dat

#save the probs
ndat$probs <- lg3$fitted.values
# convert probabilities into labels
ndat$pred_solar <- as.character(ndat$probs)
ndat$pred_solar[as.numeric(ndat$pred_solar)<0.5] <- "N"
ndat$pred_solar[as.numeric(ndat$pred_solar)>=0.5] <- "Y"

xtabs(~Solar + pred_solar, data=ndat)
     pred_solar
Solar   N   Y
    N 452  97
    Y  95 247

We can see the model performs slightly better than before (using the same 0.5 cutoff). Using a crude measure of accuracy ( (TP+TN)/N) ) we can say the model shows 79% accuracy. A small improvement.

Let’s also examine the distribution of probabilities from our third model:

ggplot() + geom_density(aes(lg3$fitted.values),alpha=.3, bw=0.05) + xlab("Class probability")

Wow! Very different to before. We will need to optimize the cutoff here, rather than just using our default 0.5 value. We can optimist it as follows:

library(ROCR)
pred <- prediction(ndat$probs,ndat$Solar)

cost.perf = performance(pred, "cost")
cut = pred@cutoffs[[1]][which.min(cost.perf@y.values[[1]])]

Let’s recompute our confusion matrix with the new cutoff:

ndat <- dat

#save the probs
ndat$probs <- lg3$fitted.values
# convert probabilities into labels
ndat$pred_solar <- as.character(ndat$probs)
ndat$pred_solar[as.numeric(ndat$pred_solar)<cut] <- "N"
ndat$pred_solar[as.numeric(ndat$pred_solar)>=cut] <- "Y"

xtabs(~Solar + pred_solar, data=ndat)
     pred_solar
Solar   N   Y
    N 521  28
    Y 141 201

Using a crude measure of accuracy, we now score 81%.

Let’s now check the probability distributions of our data:

ggplot(ndat, aes(x=probs, fill=Solar)) + geom_density(alpha=.3, bw=0.02) + xlab("Probabilities")

This graph is harder to read than our previous one. But we can clearly see that the density of solar=N classes decreases with higher probabilities, whilst our solar=Y still seems pretty poor. Our new model is therefore better at discriminating true/false negatives than true/false positives (but overall is improved). We can quickly look at the improvement in our true negative / true positive trade off with the AUROC score we introduced earlier:

rocr_ <- roc(ndat$Solar, ndat$probs)
plot.roc(rocr_, xlim=c(1,0), ylim=c(0,1), xlab="True negative rate", ylab="True positive rate")

rocr_$auc
>> Area under the curve: 0.8446

This score represents a small (but significant) improvement over the ~0.77 achieved with our simple model.

Note that the assumptions underlying a logistic model are generally more lax than a linear model. For example we haven’t looked at residual plots or leverage plots here. Whilst residuals and leverage can matter in a logistic context, their interpretation is much more complicated than in a linear model. We will therefore not cover them here, and instead focus on simpler methods to evaluate model fits.

2.6 Interpreting logistic results

We have now examined the scoring and diagnosing of a logistic model. At this point we are confident that our model is accurately capturing a substantial amount of variation in the underlying data (as it is able to achieve an 80% plus accuracy).

Let’s now look at the coefficients and other regression terms from our model:

library(stargazer)
stargazer(lg3, type = "html",style="all", title="model 3",notes="", notes.append=FALSE, omit.table.layout = "na")
model 3
Dependent variable:
Solar
Gendermale -3.630***
(0.432)
t = -8.396
p = 0.000
Expenditure -0.055
(0.061)
t = -0.907
p = 0.365
Acres -1.014***
(0.120)
t = -8.480
p = 0.000
DistrictC -11.588
(624.143)
t = -0.019
p = 0.986
DistrictQ -11.756
(624.143)
t = -0.019
p = 0.985
DistrictS -12.145
(624.143)
t = -0.019
p = 0.985
Repayment 0.003
(0.018)
t = 0.144
p = 0.886
Gendermale:Expenditure 0.207***
(0.077)
t = 2.696
p = 0.008
Constant 15.462
(624.145)
t = 0.025
p = 0.981
Observations 891
Log Likelihood -404.972
Akaike Inf. Crit. 827.944
Residual Deviance 809.944 (df = 882)
Null Deviance 1,186.655 (df = 890)

Let’s refresh these terms, starting with the top part of the table:

  • The coefficients are the unbracketed numbers in line with the variable name, coefficients are given as log-odds (see below)
  • bracketed numbers refer to standard errors (also in log-odds)
  • t refers to the t-statistic (\(coefficient / std.err\))
  • p refers to the p-value
  • The constant term at the bottom is our intercept

The bottom part of the table contains information that allows goodness-of-fit to be assessed:

  • Observations - the number of data points in the model
  • Log liklihood - This is a scoring metric that can be used to score models with an equal number of observations
  • AIC - a score of model fit (see AMP lesson 3)
  • Residual deviance - reflects the difference between this model and a “perfect” model, smaller numbers indicate increasing “perfection”
  • Null deviance - reflects the difference between this model and an intercept-only model, smaller numbers indicate possible over-fitting

In short, we can use the AIC to compare between models we have built, and we can use the residual and null deviances to understand the effect of including additional variable on our fit. Note that these terms should all be used relatively, i.e. there is no arbitrary AIC value to beat, but model AICs should instead be compared to one another.

2.7 Logistic coefficients

Our interpretation of coefficients is slightly different that in a linear model. We interpret the raw coefficients from a logistic model as:

Given a 1 unit increase in X I expect a coeff increase in the log-odds of an event occuring

Where here our “event” is the adoption of solar and coeff is the coefficient value.

For our interaction term, we can interpret the coefficient as:

Given a 1 unit increase in X1 (for a given X2), we expect a coeff increase in the log-odds of an event occuring.

In this case, that translates as:

Given a 1 unit increase in Expenditure, when the farmer is male I expect a 0.2 increase in the log-odds of solar adoption

2.8 What are log odds?

At this point, you might be asking what “log-odds” means. The odds ratio from a logistic regression tells you what increased probability there is for an event to occur, the log-odds ratio is simply the logarithm of this term. To simply things, let’s convert our log odds into simple odds ratio (OR), and add some confidence intervals:

ORs = exp(cbind(OR = coef(lg3), confint(lg3)))
library(knitr)

kable(ORs, digits=3)
OR 2.5 % 97.5 %
(Intercept) 5191149.074 0.000 NA
Gendermale 0.027 0.011 6.100000e-02
Expenditure 0.946 0.839 1.067000e+00
Acres 0.363 0.286 4.570000e-01
DistrictC 0.000 NA 2.926192e+37
DistrictQ 0.000 NA 2.265986e+37
DistrictS 0.000 NA 1.745059e+37
Repayment 1.003 0.967 1.040000e+00
Gendermale:Expenditure 1.230 1.058 1.431000e+00

By taking the exponent, we have converted our log-odds into simple odds-ratios. We can now more easily interpret the model results:

We have an odds ratio of 1.003 for Repayment So a 1 unit increase in Repayment is associated with a 1.003 factor increase in probability of adopting solar. Or, a 1 unit increase in Repayment is associates with a 99.3% increase in solar adoption (with everything else held constant)

So far, so good. Where it might get tricky is when we have an odds ratio of < 1.0. For example, how do we interpret GenderMale with its 0.027 coefficient?

The simplest way to do this is to take the reciprocal of the coefficient and flip the class. So an OR of 0.027 for GenderMale becomes 38 for Genderfemale. Or, female farmers are associated with a 38 factor increase in solar adoption (when other factors are constant)

Therefore, by combining the reported significance of variables from the regression table, with the exponented log-odds ratios, we can say that the following factors are significant:

  • Gender, female OR = 38
  • Acres, OR = 0.363
  • Gender:Expenditure = 1.23

Before we summarise our findings from this model, it is worth visiting one last assumption that applies to logistic and linear models.

2.9 Assumption: Correlated predictors

Let’s take a quick detour to test an assumption that applies to both linear and logistic regressions:

  • Predictor variables are not significantly correlated with each other (aka. no co-linearity)

Having highly correlated predictor variables makes it difficult to extract accurate coefficient errors. We can view the correlations within our predictors by calculating the variable inflation factor or VIF. VIF reports on the estimated inflation in variance due to correlated predictors, so a higher VIF indicates inflated errors and possible issues around robust coefficient interpretation. Let’s calculate the VIF for our model:

library(car)
vif(lg2)
##                 GVIF Df GVIF^(1/(2*Df))
## Gender      1.089564  1        1.043822
## Expenditure 1.196321  1        1.093764
## Acres       1.311830  1        1.145352
## District    1.153796  3        1.024129
## Repayment   1.003053  1        1.001525

Our VIF scores are all quite low, as a rule of thumb, VIF scores over 2.0 are a cause for concern. The highest VIF score seen here is 1.3 for Acres this indicates that Acres variance is inflated by a factor of 1.3. It is important that the analyst communicates when a potentially interesting finding is afflicted with co-linearity (high VIF).

As we are not seeing any large VIF values, we can proceed to interpreting the results of our analysis.

2.10 Interpreting non RCT results

We have now run a logistic regression on historical OAF data and checked the key assumptions. Overall we are satisfied that:

  • We do not have severely correlated predictor variables (low VIFs)
  • We have a model that accurately (80% plus) matches our fitted values to our real values (confusion matrix)
  • Our data set is balanced, and our FP and FN are likewise balanced (confusion matrix)
  • We are not over-fitting (p < 0.05 ANOVA results)

We are now in a place to interpret our results, possibly the most interesting findings are that repayment is non-significant, i.e. solar adoption is not associated with a change in end of season repayment percentage.

Our data also suggests that female farmers are far more likely to adopt solar; and that their decision is unrelated to weekly expenditure on lighting. In contrast, weekly expenditure on lighting is significantly correlated with adopting solar for male farmers. So much so, that every extra US dollar spent on lighting per week is associated with a 1.23 factor increase in solar adoption. We have therefore managed to pick out some interesting gender specific effects.

It is worth re-iterating here that all we can find when analyzing historical data is correlations and not causations. This subtly changes our interpretation: for example, from the above coefficient table we can see that season loan repayment is not correlated with solar adoption. Given the benefits of solar as both a quality of life product and an expense-offsetting product, this might suggest a wider roll out of solar is impact-positive and finance-neutral for One Acre Fund. However, as this is not an RCT, we cannot be sure that there is not an unobservable factor which causes farmers to both adopt solar and repay their loans on time. For example, it could be that only wealthier farmers adopt solar, and therefore they are able to repay their loans regardless, and that pushing solar adoption to a wider demographic of farmers would lead to poorer repayment as less-wealthy farmers begin to adopt solar systems.

Likewise, smaller land sizes are associated with higher solar adoption. Perhaps this indicates that poorer farmers are more likely to adopt solar, or perhaps purchasing an expensive solar home system causes a reduction in planted acres (due to total loan size constraints). In a non RCT scenario, we have no way to understand causality, and so no way to assess these statements.

Be very careful when interpreting and explaining results from non-RCT analyses, as an analyst it is your responsibility to make sure that decision-makers in One Acre Fund are aware of the caveats of your findings.

3 Logistic mixed models

We have now seen how to run a simple logistic regression on farmer level data, however, we have not yet accounted for random effects (see AMP lesson 3 for a refresher). We will now turn to our mixed effects model to understand the role that random effects have on our regression results, please ensure you are familiar with the packages and concepts introduced in AMP lesson 3 before proceeding, as this lesson will assume you have worked through AMP in sequence.

3.1 Running a mixed logistic model

For logistic mixed effects modeling we will use the lme4 package (rather than the nlme package used in AMP lesson 3).

The arguments are slightly different to the nlme package. For lme4 we want to specify:

  • Formula, as in the logistic regression above
  • Random effects We add this directly to the formula as + (1|District) - which will allow a separate intercept for each District
  • family which is binomial as above

we will ignore the other terms for now, see the lme4 page for more advanced functionality.

Let’s run a mixed model using the findings from our simple model to inform on variable choice:

library(lme4)

mix1 <- glmer(Solar ~ Acres + Gender + Expenditure + Gender*Expenditure  +
    (1 | District), data = dat, family = binomial)
summary(mix1)
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: Solar ~ Acres + Gender + Expenditure + Gender * Expenditure +  
    (1 | District)
   Data: dat

     AIC      BIC   logLik deviance df.resid 
   827.5    856.2   -407.7    815.5      885 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.1220 -0.4524 -0.2954  0.4925  3.9327 

Random effects:
 Groups   Name        Variance Std.Dev.
 District (Intercept) 0.03872  0.1968  
Number of obs: 891, groups:  District, 4

Fixed effects:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)             3.80367    0.44737   8.502  < 2e-16 ***
Acres                  -1.02775    0.11772  -8.731  < 2e-16 ***
Gendermale             -3.60501    0.43069  -8.370  < 2e-16 ***
Expenditure            -0.04587    0.06126  -0.749  0.45406    
Gendermale:Expenditure  0.20002    0.07668   2.609  0.00909 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) Acres  Gndrml Expndt
Acres       -0.517                     
Gendermale  -0.710  0.165              
Expenditure -0.625 -0.193  0.676       
Gndrml:Expn  0.600 -0.060 -0.902 -0.750

This output should look somewhat familiar (AMP lesson 3 ). Breaking down the sections:

  • We start with a reminder of the formula called, and some criteria for evaluating model fits (AIC, BIC etc)
  • We then see a summary of model residuals
  • Random effect coefficients are then summarized
  • Our fixed effects can be interpreted as the logistic model above
  • Finally we see a non-VIF method for calculating predictor variable correlation

Remember that our coefficients are all in logit space, so need to be exponented to convert them to simple ORs;

#compute stderrs
sers <- sqrt(diag(vcov(mix1)))
# table of estimates with 95% CI
tabf <- cbind(OR = fixef(mix1), LowerCI = fixef(mix1) - 1.96 * sers, UpperCI = fixef(mix1) + 1.96 *
    sers)

tabr <- ranef(mix1)$District
tar = data.frame("Mean intercept"=mean(tabr$`(Intercept)`), "Stdev"=sd(tabr$`(Intercept)`))
rownames(tar) = "District random effects"

#exponent
kable(exp(tar))
Mean.intercept Stdev
District random effects 1.000301 1.155549

An OR of 1.0 indicates that the variable does not influence solar adoption at all. We can see that District random effects, therefore, have a negligible effect on our outcome, suggesting that ICC is minimal. Let’s double check this with our ICC calculator from AMP lesson 2:

ICC_CI <- function(cluster_level,outcomevar, dataf){
  #load library
require(ICC)
  set.seed(123)
  si = round(dim(dataf)[1]*0.66)
  values_f <- c()
  for(i in seq(1:50)){
  samp_f = dataf[sample(nrow(dataf), si), ]
  x_f = ICCbare(cluster_level,outcomevar,samp_f)
  values_f <- c(values_f, x_f)
  }
  # note that 1.96StDevs = 95% confidence interval bounds in a normal dist.
   ret = data.frame("Mean ICC" = round(mean(values_f, na.rm=TRUE),3), "CI" = round(1.96*sd(values_f, na.rm=TRUE),3))
   ret$Significant = ifelse(ret$Mean.ICC > ret$CI, "Y", "N")
  return( ret)
  
  }
#using kable to make the dataframe that is returned look pretty

stored_ICC <- ICC_CI("District", "sol", dat)

stored_ICC
  Mean.ICC    CI Significant
1    0.066 0.035           Y

The ICC calculation affirms that our ICC is minimal, at 0.066. We can therefore confidently say that District level heterogeneity does not significantly perturb our understanding of solar adoption (for the Districts in this data set).

Note that a fair critique here would be asking how confident we can be that District level random effects are minimal when we have only three districts, in other words. It is generally difficult to accurately specify a distribution with only 3 data points. It’s therefore possible that we merely have too low a sample size to detect significant random effects. Answering this question requires the application of some more advanced mathematics, and in keeping with the math-lite spirit of the AMP it will not be answered here. However, it is always good practice to be thinking about [statistical power and sample size](AMP lesson 1.

Lastly, let’s compare the fits of our simple logistic regression to our mixed model:

anova(mix1,lg3,test="Chisq")
## Data: dat
## Models:
## mix1: Solar ~ Acres + Gender + Expenditure + Gender * Expenditure + 
## mix1:     (1 | District)
## lg3: Solar ~ Gender + Expenditure + Acres + District + Repayment + 
## lg3:     Gender * Expenditure
##      Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
## mix1  6 827.46 856.21 -407.73   815.46                         
## lg3   9 827.94 871.08 -404.97   809.94 5.5118      3     0.1379

We can see that our Chi-squared test indicates that there is no significant difference in model performance (note the 0.14 p-value).

This means that we expect our models to yield similar coefficient results:

stargazer(lg3, mix1, type="html", title="Comparison of models")
Comparison of models
Dependent variable:
Solar
logistic generalized linear
mixed-effects
(1) (2)
Gendermale -3.630*** -3.605***
(0.432) (0.431)
Expenditure -0.055 -0.046
(0.061) (0.061)
Acres -1.014*** -1.028***
(0.120) (0.118)
DistrictC -11.588
(624.143)
DistrictQ -11.756
(624.143)
DistrictS -12.145
(624.143)
Repayment 0.003
(0.018)
Gendermale:Expenditure 0.207*** 0.200***
(0.077) (0.077)
Constant 15.462 3.804***
(624.145) (0.447)
Observations 891 891
Log Likelihood -404.972 -407.728
Akaike Inf. Crit. 827.944 827.456
Bayesian Inf. Crit. 856.210
Note: p<0.1; p<0.05; p<0.01

Both models are in broad agreement about the significance and magnitude of variables. We could therefore use either model output to communicate our findings. My preferred method is the mixed model:

kable(exp(tabf))
OR LowerCI UpperCI
(Intercept) 44.8653693 18.6680801 107.8258367
Acres 0.3578103 0.2840873 0.4506651
Gendermale 0.0271873 0.0116884 0.0632377
Expenditure 0.9551696 0.8470928 1.0770354
Gendermale:Expenditure 1.2214262 1.0509899 1.4195018

As previously stated, we see that Gender, Acres and Gender:expenditure correlate significantly with solar adoption.

This concludes lesson 4. After working through lessons 1-4 you should now be able to confidently analyse a wide range of data using the most appropriate tool.

4 Lesson summary

This lesson should be used in conjunction with AMP lesson 3 to analyse RCT or non-RCT data:

  • Power calculations and MDEs will enable you to understand the limitations of your study, both for hypothesis testing and regression analysis

  • Regression assumptions and diagnostics are incredibly important. Remember to check your linear regression for:
    • Normal residuals
    • Linear relationships
    • Lack of high-leverage points
    • Normal random effects (if they’re included)
  • Regression assumptions and diagnostics are incredibly important. Remember to check your logistic regression for:
    • AUC scoring
    • confusion matrices
    • Random effects
  • We define a good model as one that can:
    • Explain/predict the full range of variable Y
    • Is unbiased
    • Can produce robust confidence intervals
    • Trades off simplicity and accuracy (hence ANOVA testing)
  • Present all relevant materials (diagnostics, justifications) in your reports, you can place these in the appendix.

  • Ensure reports are clear and reproducible, I highly reccomend learning how to use R-markdown to produce reports combining text, figures and code.

5 Analysis check-list

When conducting an analysis, be sure to check and include the following:

  • Check hypotheses are robustly defined
  • Check power and MDEs for each key hypothesis
  • Hypothesis test each hypothesis with the correct test!
  • Check the pre-analysis plan for guidance on whether the analysis is to be carried out by summarization or mixed modelling (most likely, it will be the former)
  • Carry out regression analysis, testing assumptions with model-plots as you go
  • Keep these tests/plots in an appendix for future readers!
  • test model quality through ANOVA testing to ensure model additions are warranted.
  • Clearly state any model caveats or limitations in the main body of the text!
  • Present any justifications for analysis or model choices (even if it seems obvious) in the appendix.