We previously covered the key parts of RCT analysis in AMP lesson 3, where we used hypothesis testing and linear regression to understand the effects of a 2016 Kenya program innovation trial.
We will now turn to a second major set of analyses that are common at One Acre Fund; the analysis of historical data. Historical data refers to any non-RCT data, and could include analysing country-wide repayment data to find key correlations, or analysing loan components to understand product adoption trends. At One Acre fund this is often call “driver analysis”, however this is incorrect as driver implies an understanding of causality, which we cannot get with non-RCT data.
Whilst historical data is unable to inform us on causality, it is incredibly useful for finding new hypotheses, and for assessing our understanding of the relationships between variables.
In this lesson we will analyse synthetic OAF data using two new tools: logistic regression and mixed-effect logistic regression. Logistic regression is useful when your outcome variable is binary (e.g. anything with a true/false, yes/no outcome). Whilst the context for this lesson will be non RCT data, the tools are readily applied to RCT data.
The aim of this lesson, therefore, is to both introduce some new tools, and to demonstrate how our interpretation of results changes in a non-RCT vs. an RCT setting.
As usual, I will explain the steps we are taking and the logic behind them, after this lesson you should be fully able to analyse a wide variety of One Acre Fund data sets.
Continous, categorical, ordinal variables - types of variables common in data analysis (see link)
P-value: The probability that the observed result could be generated purely by chance (i.e. the probability there is no real difference between populations)
Power: The probability that we are able to detect a difference between two populations, given a certain sample size, distribution and error rate. Note that power also has an effect on how robust our results are.
Hypothesis test: a statistical test. Each hypothesis test has two parts, a null hypothesis which says that the two populations are the same (i.e. there is 0 difference) and an alternative hypothesis which says that there is a difference between populations Often the null hypothesis is called H0 and the alternative hypothesis is called H1.
Odds-ratio (OR): The OR represents the odds that an outcome will occur given a particular variable, compared to the odds of the outcome occurring in the absence of that variable
Logistic regression: Logistic regression is used to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
We will introduce logistic regression using a synthetic data set that neatly resembles real OAF data. In this scenario, we have rolled out a new solar home system across four districts in Kenya, and now want to understand the relationships between adoption of solar and other variables of interest. Note that as this is not an RCT, we have to be very careful with our causality statements; essentially, we will be able to say that A correlates with B, but not that A causes B.
Let’s look at the data set :
dat <- read.csv("AMP4_data.csv")
str(dat)
'data.frame': 891 obs. of 7 variables:
$ OAFID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Solar : Factor w/ 2 levels "N","Y": 1 2 2 2 1 1 1 1 2 2 ...
$ Acres : int 3 1 3 1 3 3 1 3 3 2 ...
$ Gender : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Expenditure: num 4.55 2.63 3.85 2.86 2.86 ...
$ Repayment : num 97.8 84.1 96.1 98.6 96 ...
$ District : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
We want to understand the factors associated with solar adoption, this makes solar
our outcome variable. The first thing we will want to check is how balanced our data is. Severely unbalanced data (e.g. most events are “Y” rather than “N”) can be fatal to a logistic regression, so this is a key assumption we will want to explore before we proceed:
table(dat$Solar)
N Y
549 342
We can see that whilst our data is somewhat unbalanced, it is not severely so. Approximately 40% of our farmers in the data set adopted solar. If we were to see a 20/80 or 30/70 split, then we would want to explore alternative tools for analysis.
As in AMP lesson 1, we will want to begin by exploring the data. We can begin with some box plots for continuous variables:
library(ggplot2)
# boxplot and remove redundant legend.
g2 <- ggplot(dat, aes(x=Solar, y=Expenditure, fill=Solar)) + geom_boxplot() +
guides(fill=FALSE) + ylab("Expenditure")
# boxplot and remove redundant legend.
g3 <- ggplot(dat, aes(x=Solar, y=Repayment, fill=Solar)) + geom_boxplot() +
guides(fill=FALSE) + ylab("Repayment")
multiplot(g2,g3,cols=2)
There are some subtle differences here, it looks like there may be a weak relationship between expenditure and solar adoption, whilst it’s difficult to see any relationship between solar and repayment percentage.
We can also explore factor-factor relationships with a cross-table:
xtabs(~Solar + Gender, data = dat)
Gender
Solar female male
N 81 468
Y 233 109
This table suggests that female farmers are more much likely than male farmers to adopt solar.
Remember that whilst it is important to explore the underlying data (for interesting trends, outliers etc), only a regression will be able to pick apart relationships and so give estimates for a variable in isolation.
Now we have explored the data, let us begin building a rudimentary logistic regression model.
Logistic regression works by using a logistic distribution to describe the probability of an event occurring:
In the above graph, our
yvar
is our categorical variable (e.g. a Y/N, True/False, Healthy/Sick, etc, event that takes on coded binary values). This variable can only take values of 0
or 1
. In the above graph, we can see that Xvar
shows some relationship to yvar
, and we have fitted a logistic probability distribution to the variables.
From the graph above, note:
xvar
>200, we are pretty sure yvar=0
xvar
<60 we are pretty sure yvar=1
xvar
=150 and xvar
=100 we have a very steep slope where our probability rapidly changesUnlike in a linear regression, which outputs fitted values, our logistic regression will output the probability (i.e. the green line above) that each observation belongs to a certain class (Y/N, healthy/sick etc). We can see in the above graph that we are observing a fairly strong relationship between Xvar and Yvar, that is, there is a substantial difference in the green line at low vs high xvar values.
Let’s now look at a weaker relationship, specifically solar ~ Expenditure
from our synthetic OAF data - remember that we earlier hypothesized that Expenditure
might correlate with solar
based on the box plots we saw:
Notice the stark difference between the two graphs above! The relative flatness of this curve indicates that there is no relationship between solar ~ Expenditure
. We can therefore be confident that this variable will not be statistically significant (i.e. there is a low probability that the slope of green line above is non-zero?)
We can visualize the above relationships as we only have one dependent and one independent variable. When we start adding more variables it is more difficult to clearly visualize each relationship (hopefully this concept is familiar from linear regression, AMP lesson 3). We will remember that logistic regression (and linear regression) will give us the effect of variable X1
when all other variables are held constant. It is this ability to separate out effects which makes regression incredibly useful for picking out trends and relationships.
To run our first logistic regression we will call the glm
function from base R. the glm
function expects several arguments:
formula
: A formula for the relationship you want to test, e.g. Solar ~ Gender + Expenditure
family
: The type of linear model you want to run, in this case binomial
data
: The dataframe containing your dataOur first model will look only at Gender
and Expenditure
variables, and will be called as follows:
lg1 <- glm(Solar ~ Gender + Expenditure, data = dat, family = "binomial")
As with the linear models in AMP lesson 3, we will want to check our model’s performance to understand whether (and to what extent) we can trust the results. Unlike a linear regression, the output from a logistic regression requires a little more processing.
As previously mentioned, when examining the logistic regression outputs, we are actually examining the probabilities that each row belongs to a certain class. This is usually the probability of belonging to the first class (which here is Solar=Y
).
Let’s examine a density plot of these raw probabilities:
ggplot() + geom_density(aes(lg1$fitted.values),alpha=.3, bw=0.02, fill="grey") + xlab("Class probability")
This is a good looking density plot! Notice that we have two, very clearly separated, humps of probabilities around ~0.2 and ~0.7. The gulf between these humps suggests that our model is doing a good job of discriminating between our classes.
To convert these probabilities into predicted classes (e.g. “y” or “n”) we will need to draw a vertical line somewhere on that graph and call everything below that line “N” and everything above that line “Y”. This is know as optimizing the cutoff and is a common challenge when regressing categorical variables.
If we set our boundary at 0.5 and convert our probabilities into classes we get the following cross table:
ndat <- dat
#save the probs
ndat$probs <- lg1$fitted.values
# convert probabilities into labels
ndat$pred_solar <- as.character(ndat$probs)
ndat$pred_solar[as.numeric(ndat$pred_solar)<0.5] <- "N"
ndat$pred_solar[as.numeric(ndat$pred_solar)>=0.5] <- "Y"
xtabs(~Solar + pred_solar, data=ndat)
pred_solar
Solar N Y
N 468 81
Y 109 233
The table above is known as a confusion matrix, we have the actual data classes on the left, and the predicted classes on the top. Our aim is to match up as many “N”s and “Y”s and minimize the mistakes (i.e. when a solar “N” is predicted as a “Y”).
Note that each quadrant of our confusion matrix has a special name:
knitr::include_graphics("./confusion_matrix_1.png")
We can see that 468 Ns and 233 Ys are correctly matched up, we could crudely then say that our overall accuracy is (468+233)/891 = 78.6%. Not bad! (Remember that a 50% accuracy would be the same as a pure guess.) One issue with this accuracy metric, however, becomes clear when we have unbalanced data. For example, if we had a 95/5 % split in classes, then our model could reach a 95% accuracy simply by predicting all classes to be class 0. We will examine alternative accuracy measures later.
From the confusion matrix, we can say that we have 81 false positives, and 109 false negatives. So using the language in the figure above, we can say the aim of a good logistic regression is to maximize true positives and true negatives, and minimize false positives and false negatives.
Now that we have a useful language to describe our results, let’s revisit that denisty plot, but this time, we will split these fitted probabilities up by the actual class:
g1 = ggplot(subset(ndat, Solar=="Y"), aes(x=probs, fill=Solar)) + geom_density(alpha=.3, bw=0.02,fill="deepskyblue") + xlab("Probabilities") + ggtitle("Solar = Y")
g2 = ggplot(subset(ndat, Solar=="N"), aes(x=probs, fill=Solar)) + geom_density(alpha=.3, bw=0.02, fill="coral1") + xlab("Probabilities") + ggtitle("Solar = N")
multiplot(g1,g2,cols=2)
Hmm, not so good. This plot allows us to visualize where our errors are (e.g. the blue curve at ~0.2 indicates a false negative, as the model is predicting that hump of solar=Y
farmers to be solar=N
farmers). Therfore, whilst the model does reasonable well (witness the high accuracy score above), it is certainly not perfect. The most worrying part of this is the false confidence of the model, as our false probabilities line up neatly with our true probabilities. If we overlay the plots from above, we can see this clearly:
ggplot(ndat, aes(x=probs, fill=Solar)) + geom_density(alpha=.3, bw=0.02) + xlab("Probabilities")
The fact that our errors sit so snugly within our true values suggests that it is difficult to optimize this model to enable further class discrimination. You can attempt to do this manually, try placing a vertical line on the x-axis that maximizes our true negatives/positives and minimises our false negatives/positives, you’ll see you very quickly have to start making trade-offs.
We will look at how to improve this model a little later, but first, let’s look at other methods to evaluate model performance.
As mentioned above, the crude \(true positive + true negative / total N\) suffers from a few flaws. We will therefore also introduce the ROC curve.
Rather than calculating the false positive and false negative rate for each and every model, we can use a ROC curve to quickly understand the relationships between predicted probabilities and accuracy:
library(pROC)
rocr_ <- roc(ndat$Solar, ndat$probs)
plot.roc(rocr_, xlim=c(1,0), ylim=c(0,1), xlab="True negative rate", ylab="True positive rate")
A good ROC curve will have a very steep line that quickly plateaus, indicating a high TP rate, and a high TN rate (examine the graph above to understand why this is). Note also that the diagonal line indicates a model with an accuracy equal to random guessing.
Finally, we can take the integral of the graph above to quickly summarize the accuracy of our model. This integral is known as AUROC, or Area Under the ROC curve, and runs from 0.5 (random guess) to 1.0 (perfect accuracy).
We can calculate the AUROC for our model above as follows:
rocr_$auc
>> Area under the curve: 0.7735
An AUROC of 0.77 is reasonably good (an AUROC of 0.5 indicates a model that is as good as random guessing, regardless of class balance).
We now have a quick method to score our logistic regression, this allows us to iteratively add features and test model performance.
We now have a basic logistic regression working, before we examine the results from our logistic model, let’s try improving it (i.e. increasing the AUROC score).
As with a linear regression (see AMP lesson 3), there are two simple ways to improve a fit:
Let’s begin by including more variables that we think might improve our fit.
lg2 <- glm(Solar ~ Gender + Expenditure + Acres + District + Repayment , data = dat, family = "binomial")
Let’s also add an interaction term for gender and expenditure, based on the hypothesis that expenditure is modified by farmer gender.
lg3 <- glm(Solar ~ Gender + Expenditure + Acres + District + Repayment + Gender*Expenditure , data = dat, family = "binomial")
Now, before we proceed to model scoring, let’s use an ANOVA test to understand whether these additional terms are warranted. We can do this in two ways.
anova(lg1,lg2, lg3,test="Chisq")
>> Analysis of Deviance Table
>>
>> Model 1: Solar ~ Gender + Expenditure
>> Model 2: Solar ~ Gender + Expenditure + Acres + District + Repayment
>> Model 3: Solar ~ Gender + Expenditure + Acres + District + Repayment +
>> Gender * Expenditure
>> Resid. Df Resid. Dev Df Deviance Pr(>Chi)
>> 1 888 916.06
>> 2 883 817.19 5 98.862 < 2.2e-16 ***
>> 3 882 809.94 1 7.250 0.007092 **
>> ---
>> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that our model 2 is significantly better than model 1, but our model 3 (with the interaction term) is even better. The improved performance of model 3 therefore warrants the change in degrees of freedom (see AMP lesson 3 for a refresher).
anova(lg3, test="Chisq")
>> Analysis of Deviance Table
>>
>> Model: binomial, link: logit
>>
>> Response: Solar
>>
>> Terms added sequentially (first to last)
>>
>>
>> Df Deviance Resid. Df Resid. Dev Pr(>Chi)
>> NULL 890 1186.66
>> Gender 1 268.851 889 917.80 < 2.2e-16 ***
>> Expenditure 1 1.748 888 916.06 0.186148
>> Acres 1 92.960 887 823.10 < 2.2e-16 ***
>> District 3 5.899 884 817.20 0.116615
>> Repayment 1 0.003 883 817.19 0.952867
>> Gender:Expenditure 1 7.250 882 809.94 0.007092 **
>> ---
>> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This ANOVA test suggests that only Gender
, Acres
and the interaction term Gender:Expenditure
improve our fit significantly. At this stage we could remove all other variables, however we will keep them in for now as they will be helpful in explaining coefficient interpretation.
Note that finding a significant interaction term indicates that the effect of one predictor variable on the response variable is different for different values of the other predictor variable. i.e. the association of Expenditure
and Solar
is different for males
and females
.
Let’s now view the confusion matrix from the third model:
ndat <- dat
#save the probs
ndat$probs <- lg3$fitted.values
# convert probabilities into labels
ndat$pred_solar <- as.character(ndat$probs)
ndat$pred_solar[as.numeric(ndat$pred_solar)<0.5] <- "N"
ndat$pred_solar[as.numeric(ndat$pred_solar)>=0.5] <- "Y"
xtabs(~Solar + pred_solar, data=ndat)
pred_solar
Solar N Y
N 452 97
Y 95 247
We can see the model performs slightly better than before (using the same 0.5 cutoff). Using a crude measure of accuracy ( (TP+TN)/N) ) we can say the model shows 79% accuracy. A small improvement.
Let’s also examine the distribution of probabilities from our third model:
ggplot() + geom_density(aes(lg3$fitted.values),alpha=.3, bw=0.05) + xlab("Class probability")
Wow! Very different to before. We will need to optimize the cutoff here, rather than just using our default 0.5 value. We can optimist it as follows:
library(ROCR)
pred <- prediction(ndat$probs,ndat$Solar)
cost.perf = performance(pred, "cost")
cut = pred@cutoffs[[1]][which.min(cost.perf@y.values[[1]])]
Let’s recompute our confusion matrix with the new cutoff:
ndat <- dat
#save the probs
ndat$probs <- lg3$fitted.values
# convert probabilities into labels
ndat$pred_solar <- as.character(ndat$probs)
ndat$pred_solar[as.numeric(ndat$pred_solar)<cut] <- "N"
ndat$pred_solar[as.numeric(ndat$pred_solar)>=cut] <- "Y"
xtabs(~Solar + pred_solar, data=ndat)
pred_solar
Solar N Y
N 521 28
Y 141 201
Using a crude measure of accuracy, we now score 81%.
Let’s now check the probability distributions of our data:
ggplot(ndat, aes(x=probs, fill=Solar)) + geom_density(alpha=.3, bw=0.02) + xlab("Probabilities")
This graph is harder to read than our previous one. But we can clearly see that the density of solar=N
classes decreases with higher probabilities, whilst our solar=Y
still seems pretty poor. Our new model is therefore better at discriminating true/false negatives than true/false positives (but overall is improved). We can quickly look at the improvement in our true negative / true positive trade off with the AUROC score we introduced earlier:
rocr_ <- roc(ndat$Solar, ndat$probs)
plot.roc(rocr_, xlim=c(1,0), ylim=c(0,1), xlab="True negative rate", ylab="True positive rate")
rocr_$auc
>> Area under the curve: 0.8446
This score represents a small (but significant) improvement over the ~0.77 achieved with our simple model.
Note that the assumptions underlying a logistic model are generally more lax than a linear model. For example we haven’t looked at residual plots or leverage plots here. Whilst residuals and leverage can matter in a logistic context, their interpretation is much more complicated than in a linear model. We will therefore not cover them here, and instead focus on simpler methods to evaluate model fits.
We have now examined the scoring and diagnosing of a logistic model. At this point we are confident that our model is accurately capturing a substantial amount of variation in the underlying data (as it is able to achieve an 80% plus accuracy).
Let’s now look at the coefficients and other regression terms from our model:
library(stargazer)
stargazer(lg3, type = "html",style="all", title="model 3",notes="", notes.append=FALSE, omit.table.layout = "na")
Dependent variable: | |
Solar | |
Gendermale | -3.630*** |
(0.432) | |
t = -8.396 | |
p = 0.000 | |
Expenditure | -0.055 |
(0.061) | |
t = -0.907 | |
p = 0.365 | |
Acres | -1.014*** |
(0.120) | |
t = -8.480 | |
p = 0.000 | |
DistrictC | -11.588 |
(624.143) | |
t = -0.019 | |
p = 0.986 | |
DistrictQ | -11.756 |
(624.143) | |
t = -0.019 | |
p = 0.985 | |
DistrictS | -12.145 |
(624.143) | |
t = -0.019 | |
p = 0.985 | |
Repayment | 0.003 |
(0.018) | |
t = 0.144 | |
p = 0.886 | |
Gendermale:Expenditure | 0.207*** |
(0.077) | |
t = 2.696 | |
p = 0.008 | |
Constant | 15.462 |
(624.145) | |
t = 0.025 | |
p = 0.981 | |
Observations | 891 |
Log Likelihood | -404.972 |
Akaike Inf. Crit. | 827.944 |
Residual Deviance | 809.944 (df = 882) |
Null Deviance | 1,186.655 (df = 890) |
Let’s refresh these terms, starting with the top part of the table:
t
refers to the t-statistic (\(coefficient / std.err\))The bottom part of the table contains information that allows goodness-of-fit to be assessed:
In short, we can use the AIC to compare between models we have built, and we can use the residual and null deviances to understand the effect of including additional variable on our fit. Note that these terms should all be used relatively, i.e. there is no arbitrary AIC value to beat, but model AICs should instead be compared to one another.
Our interpretation of coefficients is slightly different that in a linear model. We interpret the raw coefficients from a logistic model as:
Given a 1 unit increase in X
I expect a coeff
increase in the log-odds of an event occuring
Where here our “event” is the adoption of solar and coeff
is the coefficient value.
For our interaction term, we can interpret the coefficient as:
Given a 1 unit increase in X1 (for a given X2), we expect a coeff increase in the log-odds of an event occuring.
In this case, that translates as:
Given a 1 unit increase in Expenditure
, when the farmer is male I expect a 0.2 increase in the log-odds of solar adoption
At this point, you might be asking what “log-odds” means. The odds ratio from a logistic regression tells you what increased probability there is for an event to occur, the log-odds ratio is simply the logarithm of this term. To simply things, let’s convert our log odds into simple odds ratio (OR), and add some confidence intervals:
ORs = exp(cbind(OR = coef(lg3), confint(lg3)))
library(knitr)
kable(ORs, digits=3)
OR | 2.5 % | 97.5 % | |
---|---|---|---|
(Intercept) | 5191149.074 | 0.000 | NA |
Gendermale | 0.027 | 0.011 | 6.100000e-02 |
Expenditure | 0.946 | 0.839 | 1.067000e+00 |
Acres | 0.363 | 0.286 | 4.570000e-01 |
DistrictC | 0.000 | NA | 2.926192e+37 |
DistrictQ | 0.000 | NA | 2.265986e+37 |
DistrictS | 0.000 | NA | 1.745059e+37 |
Repayment | 1.003 | 0.967 | 1.040000e+00 |
Gendermale:Expenditure | 1.230 | 1.058 | 1.431000e+00 |
By taking the exponent, we have converted our log-odds into simple odds-ratios. We can now more easily interpret the model results:
We have an odds ratio of 1.003 for Repayment So a 1 unit increase in Repayment
is associated with a 1.003 factor increase in probability of adopting solar. Or, a 1 unit increase in Repayment
is associates with a 99.3% increase in solar adoption (with everything else held constant)
So far, so good. Where it might get tricky is when we have an odds ratio of < 1.0. For example, how do we interpret GenderMale
with its 0.027 coefficient?
The simplest way to do this is to take the reciprocal of the coefficient and flip the class. So an OR of 0.027 for GenderMale
becomes 38 for Genderfemale
. Or, female farmers are associated with a 38 factor increase in solar adoption (when other factors are constant)
Therefore, by combining the reported significance of variables from the regression table, with the exponented log-odds ratios, we can say that the following factors are significant:
Gender
, female OR = 38Acres
, OR = 0.363Gender:Expenditure
= 1.23Before we summarise our findings from this model, it is worth visiting one last assumption that applies to logistic and linear models.
We have now run a logistic regression on historical OAF data and checked the key assumptions. Overall we are satisfied that:
We are now in a place to interpret our results, possibly the most interesting findings are that repayment is non-significant, i.e. solar adoption is not associated with a change in end of season repayment percentage.
Our data also suggests that female farmers are far more likely to adopt solar; and that their decision is unrelated to weekly expenditure on lighting. In contrast, weekly expenditure on lighting is significantly correlated with adopting solar for male farmers. So much so, that every extra US dollar spent on lighting per week is associated with a 1.23 factor increase in solar adoption. We have therefore managed to pick out some interesting gender specific effects.
It is worth re-iterating here that all we can find when analyzing historical data is correlations and not causations. This subtly changes our interpretation: for example, from the above coefficient table we can see that season loan repayment is not correlated with solar adoption. Given the benefits of solar as both a quality of life product and an expense-offsetting product, this might suggest a wider roll out of solar is impact-positive and finance-neutral for One Acre Fund. However, as this is not an RCT, we cannot be sure that there is not an unobservable factor which causes farmers to both adopt solar and repay their loans on time. For example, it could be that only wealthier farmers adopt solar, and therefore they are able to repay their loans regardless, and that pushing solar adoption to a wider demographic of farmers would lead to poorer repayment as less-wealthy farmers begin to adopt solar systems.
Likewise, smaller land sizes are associated with higher solar adoption. Perhaps this indicates that poorer farmers are more likely to adopt solar, or perhaps purchasing an expensive solar home system causes a reduction in planted acres (due to total loan size constraints). In a non RCT scenario, we have no way to understand causality, and so no way to assess these statements.
Be very careful when interpreting and explaining results from non-RCT analyses, as an analyst it is your responsibility to make sure that decision-makers in One Acre Fund are aware of the caveats of your findings.
We have now seen how to run a simple logistic regression on farmer level data, however, we have not yet accounted for random effects (see AMP lesson 3 for a refresher). We will now turn to our mixed effects model to understand the role that random effects have on our regression results, please ensure you are familiar with the packages and concepts introduced in AMP lesson 3 before proceeding, as this lesson will assume you have worked through AMP in sequence.
For logistic mixed effects modeling we will use the lme4
package (rather than the nlme
package used in AMP lesson 3).
The arguments are slightly different to the nlme
package. For lme4
we want to specify:
Formula
, as in the logistic regression aboveRandom effects
We add this directly to the formula as + (1|District)
- which will allow a separate intercept for each Districtfamily
which is binomial
as abovewe will ignore the other terms for now, see the lme4 page for more advanced functionality.
Let’s run a mixed model using the findings from our simple model to inform on variable choice:
library(lme4)
mix1 <- glmer(Solar ~ Acres + Gender + Expenditure + Gender*Expenditure +
(1 | District), data = dat, family = binomial)
summary(mix1)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: Solar ~ Acres + Gender + Expenditure + Gender * Expenditure +
(1 | District)
Data: dat
AIC BIC logLik deviance df.resid
827.5 856.2 -407.7 815.5 885
Scaled residuals:
Min 1Q Median 3Q Max
-4.1220 -0.4524 -0.2954 0.4925 3.9327
Random effects:
Groups Name Variance Std.Dev.
District (Intercept) 0.03872 0.1968
Number of obs: 891, groups: District, 4
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.80367 0.44737 8.502 < 2e-16 ***
Acres -1.02775 0.11772 -8.731 < 2e-16 ***
Gendermale -3.60501 0.43069 -8.370 < 2e-16 ***
Expenditure -0.04587 0.06126 -0.749 0.45406
Gendermale:Expenditure 0.20002 0.07668 2.609 0.00909 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) Acres Gndrml Expndt
Acres -0.517
Gendermale -0.710 0.165
Expenditure -0.625 -0.193 0.676
Gndrml:Expn 0.600 -0.060 -0.902 -0.750
This output should look somewhat familiar (AMP lesson 3 ). Breaking down the sections:
Remember that our coefficients are all in logit space, so need to be exponented to convert them to simple ORs;
#compute stderrs
sers <- sqrt(diag(vcov(mix1)))
# table of estimates with 95% CI
tabf <- cbind(OR = fixef(mix1), LowerCI = fixef(mix1) - 1.96 * sers, UpperCI = fixef(mix1) + 1.96 *
sers)
tabr <- ranef(mix1)$District
tar = data.frame("Mean intercept"=mean(tabr$`(Intercept)`), "Stdev"=sd(tabr$`(Intercept)`))
rownames(tar) = "District random effects"
#exponent
kable(exp(tar))
Mean.intercept | Stdev | |
---|---|---|
District random effects | 1.000301 | 1.155549 |
An OR of 1.0 indicates that the variable does not influence solar adoption at all. We can see that District random effects, therefore, have a negligible effect on our outcome, suggesting that ICC is minimal. Let’s double check this with our ICC calculator from AMP lesson 2:
ICC_CI <- function(cluster_level,outcomevar, dataf){
#load library
require(ICC)
set.seed(123)
si = round(dim(dataf)[1]*0.66)
values_f <- c()
for(i in seq(1:50)){
samp_f = dataf[sample(nrow(dataf), si), ]
x_f = ICCbare(cluster_level,outcomevar,samp_f)
values_f <- c(values_f, x_f)
}
# note that 1.96StDevs = 95% confidence interval bounds in a normal dist.
ret = data.frame("Mean ICC" = round(mean(values_f, na.rm=TRUE),3), "CI" = round(1.96*sd(values_f, na.rm=TRUE),3))
ret$Significant = ifelse(ret$Mean.ICC > ret$CI, "Y", "N")
return( ret)
}
#using kable to make the dataframe that is returned look pretty
stored_ICC <- ICC_CI("District", "sol", dat)
stored_ICC
Mean.ICC CI Significant
1 0.066 0.035 Y
The ICC calculation affirms that our ICC is minimal, at 0.066. We can therefore confidently say that District level heterogeneity does not significantly perturb our understanding of solar adoption (for the Districts in this data set).
Note that a fair critique here would be asking how confident we can be that District level random effects are minimal when we have only three districts, in other words. It is generally difficult to accurately specify a distribution with only 3 data points. It’s therefore possible that we merely have too low a sample size to detect significant random effects. Answering this question requires the application of some more advanced mathematics, and in keeping with the math-lite spirit of the AMP it will not be answered here. However, it is always good practice to be thinking about [statistical power and sample size](AMP lesson 1.
Lastly, let’s compare the fits of our simple logistic regression to our mixed model:
anova(mix1,lg3,test="Chisq")
## Data: dat
## Models:
## mix1: Solar ~ Acres + Gender + Expenditure + Gender * Expenditure +
## mix1: (1 | District)
## lg3: Solar ~ Gender + Expenditure + Acres + District + Repayment +
## lg3: Gender * Expenditure
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## mix1 6 827.46 856.21 -407.73 815.46
## lg3 9 827.94 871.08 -404.97 809.94 5.5118 3 0.1379
We can see that our Chi-squared test indicates that there is no significant difference in model performance (note the 0.14 p-value).
This means that we expect our models to yield similar coefficient results:
stargazer(lg3, mix1, type="html", title="Comparison of models")
Dependent variable: | ||
Solar | ||
logistic | generalized linear | |
mixed-effects | ||
(1) | (2) | |
Gendermale | -3.630*** | -3.605*** |
(0.432) | (0.431) | |
Expenditure | -0.055 | -0.046 |
(0.061) | (0.061) | |
Acres | -1.014*** | -1.028*** |
(0.120) | (0.118) | |
DistrictC | -11.588 | |
(624.143) | ||
DistrictQ | -11.756 | |
(624.143) | ||
DistrictS | -12.145 | |
(624.143) | ||
Repayment | 0.003 | |
(0.018) | ||
Gendermale:Expenditure | 0.207*** | 0.200*** |
(0.077) | (0.077) | |
Constant | 15.462 | 3.804*** |
(624.145) | (0.447) | |
Observations | 891 | 891 |
Log Likelihood | -404.972 | -407.728 |
Akaike Inf. Crit. | 827.944 | 827.456 |
Bayesian Inf. Crit. | 856.210 | |
Note: | p<0.1; p<0.05; p<0.01 |
Both models are in broad agreement about the significance and magnitude of variables. We could therefore use either model output to communicate our findings. My preferred method is the mixed model:
kable(exp(tabf))
OR | LowerCI | UpperCI | |
---|---|---|---|
(Intercept) | 44.8653693 | 18.6680801 | 107.8258367 |
Acres | 0.3578103 | 0.2840873 | 0.4506651 |
Gendermale | 0.0271873 | 0.0116884 | 0.0632377 |
Expenditure | 0.9551696 | 0.8470928 | 1.0770354 |
Gendermale:Expenditure | 1.2214262 | 1.0509899 | 1.4195018 |
As previously stated, we see that Gender
, Acres
and Gender:expenditure
correlate significantly with solar adoption.
This concludes lesson 4. After working through lessons 1-4 you should now be able to confidently analyse a wide range of data using the most appropriate tool.
This lesson should be used in conjunction with AMP lesson 3 to analyse RCT or non-RCT data:
Power calculations and MDEs will enable you to understand the limitations of your study, both for hypothesis testing and regression analysis
Present all relevant materials (diagnostics, justifications) in your reports, you can place these in the appendix.
Ensure reports are clear and reproducible, I highly reccomend learning how to use R-markdown to produce reports combining text, figures and code.
When conducting an analysis, be sure to check and include the following: