viewof smoother_select = Inputs.select(['hidden','visible']);mutable show_code = smoother_select=='visible';html`<style>.code-fold { display: ${show_code ?'block':'none'};}.dropdown-container { display: inline-block; position: relative; line-height: 1; /* Adjust line-height to match the text */}select { border: none; border-bottom: 2px dotted grey; background: none; font-size: 1em; padding: 0 5px; -webkit-appearance: none; -moz-appearance: none; appearance: none; line-height: 1; /* Adjust line-height to match the text */ vertical-align: baseline; text-align: center;}select:focus { outline: none;}/* Add a small arrow for the dropdown */.dropdown-container::after { content: '▼'; font-size: 0.7em; position: absolute; right: 5px; top: 50%; transform: translateY(-50%); pointer-events: none;}/* Adjusting padding and margin to align with the text */.dropdown-container select { padding-right: 20px;}</style>`
Code
html`To learn more about how the output in this article was created you can choose to show the code I used. Currently the code in this article is <span class="dropdown-container" style="width: 5rem;">${viewof smoother_select}</span>.`
The seminal paper in the synthetic control literature, Abadie and Gardeazabal (2003), explored the impact of terrorism on the Basque Country. In this article I reproduce some of the key figures and tables from this paper using R. In particular, I use tidyverse for data manipulation and visualisation. All the code is available so that others may use this as inspiration for their own analyses.
The paper explored the economic impact of terrorism in the Basque Country by Euskadi Ta Askatasuna (ETA). Economic impact was measured using the Gross Domestic Product (GDP) per capita.
In the paper, the ETA’s activity was summarised using a table showing key events along with the number of killings and kidnappings over time. I have visualised the pattern of activity as an annotated bar chart in Figure 1, using the same event descriptions and dates.
Since we can’t know what would have happened to the Basque Country’s GDP in the absence of terrorism we need to find a sensible comparison unit. The most obvious approach might be to select a region with the closest GDP per capita or even take an average of regions with a similar GDP per capita. However, we can do better. The trajectory of a region’s GDP can already be partially predicted by current characteristics of that region. For example, variables such as investment ratio and population density predict some of the changes to a region’s GDP. Therefore, to improve the suitability of our comparison unit, we construct it such that these predictors also match well. This is the approach taken in Abadie and Gardeazabal (2003), with the comparison unit referred to as a synthetic control.
In the study they match the per capita GDP (average for 1960-1969), investment ratio (average for 1964-1969), population density (1969), sectoral shares (average for 1961-1969), and human capital (1964-1969). Sectoral shares is the percentage of production by various sectors. Human capital is the percentage of education levels in the working-age population.
We load the required data from the Synth package in R. I have visualised which data is available each year in Figure 2. We see that the matching variables are available in the required years, although sector values are biennial.
The data is already in a tidy format as illustrated in Figure 3. This makes it particularly easy to prepare the matching variables using tidyverse functions.
We can use the pipe in Figure 4 to prepare the sector values. Note that the sector values are already percentages which is why we don’t divide the result by total production. Similar sequences of dplyr and tidyr functions were used to prepare the other variables as well.
Next we compute the synthetic control according to the approach in Abadie and Gardeazabal (2003). An explanation of the inputs required to apply the synthetic control method is shown in Table 1.
Table 1: Explanation of the inputs for the synthetic control method.
Variable
Description
\(X_1\)
(number of predictors)\(\times 1\) matrix of predictors for the treated unit
\(X_0\)
(number of predictors)\(\times\)(number of controls) matrix of predictors for the control units
\(Y_1^{\mathrm{pre}}\)
(number of pre-treatment periods)\(\times 1\) matrix of outcomes for the treated unit over the pre-treatment period
\(Y_0^{\mathrm{pre}}\)
(number of pre-treatment periods)\(\times\)(number of controls) matrix of outcomes for the control units over the pre-treatment period
In Abadie and Gardeazabal (2003) the synthetic control weights were determined by a bilevel optimisation. The outer optimisation involves finding predictor importance weights. The outer optimisation is not necessary when we know predictor importances a priori, but that is rarely the case. The inner optimisation involves finding the synthetic control weights, conditional on predictor importances. In mathematical terms, the synthetic control weights \(W^*\) are given by the following optimisation problem, \[
\begin{split}
W^*&:=W(V^*)\\
V^*&:=\underset{V}{\arg \min} L(V)\\
L(V)&=(Y_1^{\mathrm{pre}}-Y_0^{\mathrm{pre}}W(V))^{'}(Y_1^{\mathrm{pre}}-Y_0^{\mathrm{pre}}W(V))\\
W(V)&=\underset{W}{\arg \min} (X_1-X_0W)^{'} V (X_1-X_0W).
\end{split}
\tag{1}\]
Code
variable_name_lookup <-list(real_per_capita_gdp ="Real per capita GDP",investment_ratio ="Investment ratio (percentage)",population_density ="Population density",sec.agriculture ="Agriculture, forestry, and fishing",sec.energy ="Energy and water",sec.industry ="Industry",sec.construction ="Construction and engineering",sec.services.venta ="Marketable services",sec.services.nonventa ="Nonmarketable services",human_capital.illiterates ="Illiterates",human_capital.primary_or_no_studies ="Primary or without studies",human_capital.high_school ="High school",human_capital.high_school_or_more ="More than high school")basque.X <-list( col_gdp_per_cap, col_human_capital_pct, col_invest_ratio_pct, col_pop_density, col_sectoral_share_pct) |>reduce(inner_join, by=join_by(regionname)) |>mutate(regionname =clean_regions(regionname)) |>pivot_longer(-regionname) |>pivot_wider(names_from=regionname) |>select(basque_country, everything()) |>column_to_rownames('name') |>as.matrix() |>`[`(value=_,names(variable_name_lookup),)region_name_lookup <- basque$regionname |>unique() |>sort() |>remove_bracket()names(region_name_lookup) <-clean_regions(region_name_lookup)basque.y <- basque |>select(year, regionname, gdpcap) |>mutate(regionname =clean_regions(regionname)) |>arrange(regionname) |>pivot_wider(names_from = regionname, values_from = gdpcap) |>select(year, basque_country, everything()) |>arrange(year) |>column_to_rownames('year') |>as.matrix()
To find the optimal weights for this optimisation problem I used the synth function from the Synth package. The results of the optimisation is shown in Table 2. The outer optimisation solution is shown in Table 2 (a). The solution to the inner optimisation problem, given \(V^*\), is shown in Table 2 (b). These are the synthetic control weights. Only two of the regions have non-zero weights, Cataluna (0.851) and Madrid (0.149).
Table 2: The solution to Equation 1 using the Basque Country data from Abadie and Gardeazabal (2003).
(a) The solution to the upper-level (outer) optimisation, \(V^*\).
Predictor
Weight
Real per capita GDP
0.202
Investment ratio (percentage)
0.006
Population density
0.339
Agriculture, forestry, and fishing
0.095
Energy and water
0.008
Industry
0.133
Construction and engineering
0.010
Marketable services
0.002
Nonmarketable services
0.106
Illiterates
0.014
Primary or without studies
0.005
High school
0.045
More than high school
0.035
(b) The solution to the corresponding lower-level (inner) optimisation, \(W^*\).
Region
Weight
Andalucia
0.000
Aragon
0.000
Baleares
0.000
Canarias
0.000
Cantabria
0.000
Castilla Y Leon
0.000
Castilla-La Mancha
0.000
Cataluna
0.851
Comunidad Valenciana
0.000
Extremadura
0.000
Galicia
0.000
Madrid
0.149
Murcia
0.000
Navarra
0.000
Principado De Asturias
0.000
Rioja
0.000
Spain
0.000
Code
conversion_list <-c("Andalucía"="Andalucia","Aragón"="Aragon","Cantabria"="Cantabria","Castilla-La Mancha"="Castilla-La Mancha","Castilla y León"="Castilla Y Leon","Cataluña"="Cataluna","Ceuta y Melilla"=NA, # No direct counterpart"Comunidad de Madrid"="Madrid (Comunidad De)","Comunidad Foral de Navarra"="Navarra (Comunidad Foral De)","Comunidad Valenciana"="Comunidad Valenciana","Extremadura"="Extremadura","Galicia"="Galicia","Islas Baleares"="Baleares (Islas)","Islas Canarias"="Canarias","La Rioja"="Rioja (La)","País Vasco"="Basque Country (Pais Vasco)","Principado de Asturias"="Principado De Asturias","Región de Murcia"="Murcia (Region de)")pca_obj <-princomp(t(basque.X),cor=T)pc1_loadings <-as.matrix(pca_obj$loadings[,1])pc1 <-as.data.frame(t(basque.X) %*% pc1_loadings) |>rownames_to_column('region') |>rename(PC1 = V1)spain_sf <- sf::read_sf('spain-map/gadm41_ESP_1.shp')basque_pc1 <- pc1 |>filter(region=='basque_country') |>pull(PC1)joined_sf <- spain_sf |>mutate(region_nice =remove_bracket(as.character(conversion_list[NAME_1]))) |>drop_na(region_nice) |>mutate(region =clean_regions(region_nice)) |>left_join(pc1, by =join_by(region)) |>rowwise() |>mutate(x = sf::st_centroid(geometry)[[1]][1],y = sf::st_centroid(geometry)[[1]][2]) |>ungroup() |>mutate(PC1_prop = (PC1-basque_pc1)/basque_pc1)
To understand why the chosen regions are suitable, consider the first principal component visualised in Figure 5. While a principal component analysis is a fundamentally different approach, it appears to suggest the same conclusion, that the Basque Country’s characteristics are approximated by the weighted average of Madrid and Cataluna. Madrid is the only region with a higher first principal component than the Basque Country. Cataluna has a lower first principal component than the Basque Country, but is closer than other regions.
How well the different predictors match using the synthetic control is visualised in Figure 6. It appears from this figure that Madrid is a useful inclusion in the synthetic Basque Country because its characteristics are often the minimum or maximum among the regions. This is consistent with the story told by the first principal component.
In Abadie and Gardeazabal (2003) the characteristics for the Basque Country, Spain, and the synthetic control are summarised in a table which I have reproduced in Table 3.
When considering how well characteristics are matched between the treated unit and its synthetic control it is important to remember that not all predictors are equally important. Based on the pre-intervention outcomes, the importance of each predictor is visualised in Figure 7.
As another diagnostic step we can also consider how well principal components match. The model doesn’t explicitly try to match principal components, but if the principal components approximately match then the predictors will as well. The principal components are also easier to visualise since only the first few are important.
The principal components of the regions and synthetic control are visualised in Figure 8. It appears that the first four principal components are reasonably similar for the Basque Country and its synthetic counterpart.
In this case study it is relatively easy to understand how the synthetic control is constructed since the weighted average only uses two regions. Over time the synthetic Basque Country is a line between the GDP per capita of Cataluna and Madrid, as shown in Figure 9.
Finally, in Figure 10 we compare the Basque Country to the synthetic Basque Country. The gap between these, shown in Figure 11, is the estimated impact of terrorism in the Basque Country on its GDP per capita. The gap between the synthetic and actual Basque Country appears to coincide with the increase in terrorism activity in the late 1970s.
I hope that my version of the analysis have given some ideas for how the synthetic control method can be visualised and presented using a tidyverse approach.
References
Abadie, Alberto, and Javier Gardeazabal. 2003. “The Economic Costs of Conflict: A Case Study of the Basque Country.”American Economic Review 93 (1): 113132. https://doi.org/10.1257/000282803321455188.
Wickham, Hadley, Mine C-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Second edition. Beijing ; Sebastopol, CA: O’Reilly.
Footnotes
This research is supported by an Australian Government Research Training Program (RTP) Scholarship.↩︎