Synthetic Control Dataset: The Basque Country

PhD
SCM
R
Data-visualisation
Author

Filip Reierson1

Published

July 15, 2024

How this was made

The seminal paper in the synthetic control literature, Abadie and Gardeazabal (2003), explored the impact of terrorism on the Basque Country. In this article I reproduce some of the key figures and tables from this paper using R. In particular, I use tidyverse for data manipulation and visualisation. All the code is available so that others may use this as inspiration for their own analyses.

The paper explored the economic impact of terrorism in the Basque Country by Euskadi Ta Askatasuna (ETA). Economic impact was measured using the Gross Domestic Product (GDP) per capita.

In the paper, the ETA’s activity was summarised using a table showing key events along with the number of killings and kidnappings over time. I have visualised the pattern of activity as an annotated bar chart in Figure 1, using the same event descriptions and dates.

Code
library(patchwork)
library(tidyverse)
eta_violence <- data.frame(
  Year = 1968:2000,
  Killings = c(2, 1, 0, 0, 1, 6, 19, 16, 17, 11, 67, 76, 92, 30, 37, 32,
               32, 37, 41, 52, 19, 19, 25, 46, 26, 14, 13, 15, 5, 13, 6, 0, 23),
  Kidnappings = c(0, 0, 1, 0, 1, 1, 0, 0, 4, 1, 6, 13, 13, 10, 8, 5, 0, 3,
                  3, 1, 1, 1, 0, 0, 0, 1, 0, 1, 2, 1, 0, 0, 0)
)
eta_event <- data.frame(
  Year = c(1968, 1973, 1975, 1977, 1978, 1979, 1981, 1986, 1992, 1998, 1999),
  Event = c(
    "First victim of ETA",
    "ETA kills Franco’s Prime Minister Admiral Carrero-Blanco",
    "Dictator Franco dies",
    "First democratic elections in Spain after Franco’s death",
    "Spanish Constitution approved in referendum",
    "Regional Autonomy Statute for the Basque Country approved",
    "Attempted military coup. Spain joins NATO",
    "Spain joins European Community",
    "Barcelona hosts the Summer Olympic Games",
    "ETA declares indefinite cease-fire starting on September 18",
    "ETA announces the end of cease-fire on November 28"
  )
)
p1 <- eta_violence |>
  left_join(eta_event, by=join_by(Year)) |>
  pivot_longer(-c(Year, Event)) |>
  ggplot(aes(Year,value)) +
    scale_x_continuous(breaks=1968:2000, expand = expansion(0,0)) +
    scale_y_continuous(breaks=\(x)scales::breaks_pretty(n=5)(x)|>setdiff(0), 
                       expand = expansion(mult = 0.05)) +
    facet_wrap(~name,scales = 'free_y',ncol=1, strip.position="left") +
    geom_bar(stat='identity') +
    geom_text(aes(label=ifelse(value>0,value,''),
                  vjust=-0.3), size=3) +
    labs(x='',y='') +
    coord_cartesian(xlim = c(1968-1,2000), expand = T, clip='off') +
    theme_minimal() +
    theme(panel.grid.major.x = element_blank(),
          panel.grid.minor.x = element_blank(),
          panel.grid.minor.y = element_blank(),
          axis.ticks.x.bottom = element_line(),
          axis.ticks.y.left = element_line(),
          axis.title.x.bottom = element_blank(),
          axis.text.x.bottom = element_text(angle=90,vjust=0.5,margin=margin(t=2.5,b=2.5)),
          plot.margin = margin(0,0,0,0),
          strip.text.y.right = element_text(margin = margin(l=10)),
          strip.placement = "outside")
event_lookup <- eta_event$Event
names(event_lookup) <- as.numeric(eta_event$Year)
p2 <- eta_event |>
  rowid_to_column('num') |>
  ggplot(aes(Year, y=-num, label=scales::label_wrap(100)(Event))) +
    scale_x_discrete(breaks=1968:2000) +
    coord_cartesian(xlim = c(1968-1,2000),ylim=c(-nrow(eta_event)-1,1),
                    expand = F, clip='off') +
    geom_linerange(aes(ymax=1,ymin=-num)) +
    geom_label(aes(hjust=-0.05+(Year>1983)*1.1),size=3,vjust=0.5,
               label.size=NA,fill="white", alpha=.75, 
               label.r = unit(0, "pt"),
               label.padding = unit(2, "pt")) +
    geom_linerange(aes(xmin=Year,xmax=Year+0.25-(Year>1983)*0.5)) +
    theme_void() +
    theme(axis.title.x.bottom = element_blank(),
          axis.text.x.bottom = element_blank())
p1 + p2 + plot_layout(ncol = 1,heights = c(1,1))
Figure 1: A visualisation of the data shown in Table 1 of Abadie and Gardeazabal (2003).

Since we can’t know what would have happened to the Basque Country’s GDP in the absence of terrorism we need to find a sensible comparison unit. The most obvious approach might be to select a region with the closest GDP per capita or even take an average of regions with a similar GDP per capita. However, we can do better. The trajectory of a region’s GDP can already be partially predicted by current characteristics of that region. For example, variables such as investment ratio and population density predict some of the changes to a region’s GDP. Therefore, to improve the suitability of our comparison unit, we construct it such that these predictors also match well. This is the approach taken in Abadie and Gardeazabal (2003), with the comparison unit referred to as a synthetic control.

In the study they match the per capita GDP (average for 1960-1969), investment ratio (average for 1964-1969), population density (1969), sectoral shares (average for 1961-1969), and human capital (1964-1969). Sectoral shares is the percentage of production by various sectors. Human capital is the percentage of education levels in the working-age population.

We load the required data from the Synth package in R. I have visualised which data is available each year in Figure 2. We see that the matching variables are available in the required years, although sector values are biennial.

Code
data("basque", package = 'Synth')
remove_bracket <- function(region) {
  str_replace(region, "\\s*\\([^\\)]+\\)", "")
}
clean_regions <- function(region) {
  janitor::make_clean_names(remove_bracket(region), allow_dupes = T)
}
variable_names <- list(
  regionno = "Region Number",
  regionname = "Region Name",
  year = "Year",
  gdpcap = "GDP per Capita",
  sec.agriculture = "Agriculture (%)",
  sec.energy = "Energy (%)",
  sec.industry = "Industry (%)",
  sec.construction = "Construction (%)",
  sec.services.venta = "Market Services (%)",
  sec.services.nonventa = "Nonmarket Services (%)",
  school.illit = "Illiterate Persons",
  school.prim = "Primary Education",
  school.med = "Some High School",
  school.high = "High School Degree",
  school.post.high = "Tertiary Education",
  popdens = "Population Density",
  invest = "Investment Share"
)

basque |>
  group_by(year) |>
  summarise(across(everything(), compose(mean, is.na))) |>
  pivot_longer(-year) |>
  mutate(value = case_when(
    value == 0 ~ 'No',
    value == 1 ~ 'Yes',
    T ~ 'Some'
  )) |>
  mutate(name = factor(name,levels=colnames(basque))) |>
  ggplot(aes(year, fct_rev(name), fill=factor(value))) +
    geom_tile(color='white',lwd=.25) +
    scale_fill_manual(values = list('No'='darkgreen',
                                    'Yes'='lightgrey')) +
    coord_fixed(ratio = 1) +
    scale_x_continuous(breaks = scales::pretty_breaks(n=10), expand = expansion(add=0.1)) + 
    scale_y_discrete(labels=\(x)variable_names[x], breaks=colnames(basque), expand = expansion(add=0.75)) +
    labs(x='', y='', fill='Missing?') +
    theme_minimal() +
    theme(axis.text.x.bottom = element_text(),
          axis.ticks = element_line(),
          panel.grid = element_blank(),
          axis.text.y.left = element_text(size = 6),
          axis.title.x.bottom = element_blank(),
          axis.title.y.left = element_blank(),
          plot.margin = margin(),
          legend.margin = margin(),
          legend.box.margin = margin(),
          legend.box.spacing = unit(0.25, 'cm'),
          legend.key.size = unit(12,'pt'),
          legend.position = 'top')
Figure 2: For each year and variable combination either all or none of the regions were missing. This graph shows for which time periods each variable was available.

The data is already in a tidy format as illustrated in Figure 3. This makes it particularly easy to prepare the matching variables using tidyverse functions.

Code
table_plot <- basque |>
  head() |>
  mutate(regionname = remove_bracket(regionname)) |>
  select(regionname, year, gdpcap) |>
  mutate(gdpcap = round(gdpcap, 2)) |>
  rowid_to_column() |>
  pivot_longer(-rowid, values_transform=as.character) |>
  mutate(name = factor(name, levels=c('year', 'regionname', 'gdpcap'))) |>
  ggplot(aes(
    x=name,
    y=rowid,
    label=value
  )) +
  scale_x_discrete(position = "top") +
  scale_y_reverse() +
  geom_text(size=3.5) +
  theme_void() +
  annotate(geom='segment', x=0:3+0.5, xend=0:3+0.5,y=0.5,yend=6.5, 
           color='black') +
  annotate(geom='segment', x=0.5, xend=3.5,y=0:6+0.5,yend=0:6+0.5, 
           color='black') +
  theme(axis.text.x.top=element_text(margin=margin(b=5), size=8, face = 'bold'),
        plot.title = element_text(margin=margin(b=15), hjust=0.5, size=12))
table_plot_variables <- table_plot +
  annotate(
    geom = 'segment',
    x = 1:3,
    xend = 1:3,
    y = 1,
    yend = 6,
    lwd = 1,
    lineend = 'round',
    linejoin = 'mitre',
    arrow = arrow(
      length = unit(0.2, "cm"),
      ends = 'both'
    ),
    color = 'darkorange'
  ) +
  labs(title = 'Variables')
table_plot_obs <- table_plot +
  annotate(
    geom = 'segment',
    y = 1:6,
    yend = 1:6,
    x = 1,
    xend = 3,
    lwd = 1,
    lineend = 'round',
    linejoin = 'mitre',
    arrow = arrow(
      length = unit(0.2, "cm"),
      ends = 'both'
    ),
    color = 'darkorange'
  ) +
  labs(title = 'Observations')
table_plot_values <- table_plot +
  geom_point(
    aes(x = x, y = y),
    data = expand.grid(x = 1:3, y = 1:6),
    pch = 1,
    stroke = 2,
    size = 3,
    color = 'darkorange',
    inherit.aes = F
  ) +
  labs(title = 'Values')
table_plot_obs | table_plot_variables | table_plot_values
Figure 3: Data is considered tidy when each row is an observation, each column is a variable, and each cell contains a single value. Tidy data is a good starting point for most data preparation and visualisation tasks. The diagram is based Figure 5.1 in Wickham, C-Rundel, and Grolemund (2023).

We can use the pipe in Figure 4 to prepare the sector values. Note that the sector values are already percentages which is why we don’t divide the result by total production. Similar sequences of dplyr and tidyr functions were used to prepare the other variables as well.

graph TD
    A["basque"] --> B["filter(between(year, 1961, 1969))"]
    B --> C["select(regionname, starts_with('sec'))"]
    C --> D["pivot_longer(-regionname, names_to = 'sector')"]
    D --> E["group_by(regionname, sector)"]
    E --> F["summarise(mean = mean(value, na.rm = T), .groups='drop')"]
    F --> G["pivot_wider(names_from = sector, values_from = mean)"]
Figure 4: A pipe that prepares the sectoral share predictor variables.
Code
col_gdp_per_cap <- basque |> 
  filter(between(year,1960,1969)) |>
  group_by(regionname) |>
  summarise(real_per_capita_gdp = mean(gdpcap)*1000)

col_invest_ratio_pct <- basque |> 
  filter(between(year,1964,1969)) |>
  group_by(regionname) |>
  summarise(investment_ratio = mean(invest))

col_pop_density <- basque |> 
  filter(year==1969) |>
  group_by(regionname) |>
  summarise(population_density = mean(popdens))

col_sectoral_share_pct <- basque |>
  filter(between(year,1961,1969)) |>
  select(regionname, starts_with('sec')) |>
  pivot_longer(-regionname, names_to = 'sector') |>
  group_by(regionname, sector) |>
  summarise(mean = mean(value,na.rm = T), .groups='drop') |>
  pivot_wider(names_from = sector, values_from = mean)

col_human_capital_pct <- basque |>
  filter(between(year,1964,1969)) |>
  select(year, regionname, starts_with('school')) |>
  pivot_longer(-c(year, regionname)) |>
  group_by(year, regionname) |>
  mutate(value = value / sum(value) * 100) |>
  group_by(regionname, name) |>
  summarise(value = mean(value), .groups = 'drop') |>
  pivot_wider() |>
  rename(
    human_capital.illiterates = school.illit,
    human_capital.primary_or_no_studies = school.prim,
    human_capital.high_school = school.med
  ) |>
  mutate(human_capital.high_school_or_more = school.high + school.post.high) |>
  select(-starts_with('school'))

Next we compute the synthetic control according to the approach in Abadie and Gardeazabal (2003). An explanation of the inputs required to apply the synthetic control method is shown in Table 1.

Table 1: Explanation of the inputs for the synthetic control method.
Variable Description
\(X_1\) (number of predictors)\(\times 1\) matrix of predictors for the treated unit
\(X_0\) (number of predictors)\(\times\)(number of controls) matrix of predictors for the control units
\(Y_1^{\mathrm{pre}}\) (number of pre-treatment periods)\(\times 1\) matrix of outcomes for the treated unit over the pre-treatment period
\(Y_0^{\mathrm{pre}}\) (number of pre-treatment periods)\(\times\)(number of controls) matrix of outcomes for the control units over the pre-treatment period

In Abadie and Gardeazabal (2003) the synthetic control weights were determined by a bilevel optimisation. The outer optimisation involves finding predictor importance weights. The outer optimisation is not necessary when we know predictor importances a priori, but that is rarely the case. The inner optimisation involves finding the synthetic control weights, conditional on predictor importances. In mathematical terms, the synthetic control weights \(W^*\) are given by the following optimisation problem, \[ \begin{split} W^*&:=W(V^*)\\ V^*&:=\underset{V}{\arg \min} L(V)\\ L(V)&=(Y_1^{\mathrm{pre}}-Y_0^{\mathrm{pre}}W(V))^{'}(Y_1^{\mathrm{pre}}-Y_0^{\mathrm{pre}}W(V))\\ W(V)&=\underset{W}{\arg \min} (X_1-X_0W)^{'} V (X_1-X_0W). \end{split} \tag{1}\]

Code
variable_name_lookup <- list(
  real_per_capita_gdp = "Real per capita GDP",
  investment_ratio = "Investment ratio (percentage)",
  population_density = "Population density",
  sec.agriculture = "Agriculture, forestry, and fishing",
  sec.energy = "Energy and water",
  sec.industry = "Industry",
  sec.construction = "Construction and engineering",
  sec.services.venta = "Marketable services",
  sec.services.nonventa = "Nonmarketable services",
  human_capital.illiterates = "Illiterates",
  human_capital.primary_or_no_studies = "Primary or without studies",
  human_capital.high_school = "High school",
  human_capital.high_school_or_more = "More than high school"
)

basque.X <- list(
  col_gdp_per_cap,
  col_human_capital_pct,
  col_invest_ratio_pct,
  col_pop_density,
  col_sectoral_share_pct
) |>
  reduce(inner_join, by=join_by(regionname)) |>
  mutate(regionname = clean_regions(regionname)) |>
  pivot_longer(-regionname) |>
  pivot_wider(names_from=regionname) |>
  select(basque_country, everything()) |>
  column_to_rownames('name') |>
  as.matrix() |>
  `[`(value=_,names(variable_name_lookup),)

region_name_lookup <- basque$regionname |> unique() |> sort() |> remove_bracket()
names(region_name_lookup) <- clean_regions(region_name_lookup)

basque.y <- basque |>
  select(year, regionname, gdpcap) |>
  mutate(regionname = clean_regions(regionname)) |>
  arrange(regionname) |>
  pivot_wider(names_from = regionname, values_from = gdpcap) |>
  select(year, basque_country, everything()) |>
  arrange(year) |>
  column_to_rownames('year') |>
  as.matrix()
Code
pre_treatment <- between(as.numeric(rownames(basque.y)),1960,1969)
if(!file.exists('basque_synth.Rds')) {
  basque_synth <- Synth::synth(X1=basque.X[,1,drop=F], X0=basque.X[,-1,drop=F], 
             Z1=basque.y[pre_treatment,1,drop=F], Z0=basque.y[pre_treatment,-1,drop=F])
  saveRDS(basque_synth, 'basque_synth.Rds')
}
basque_synth <- readRDS('basque_synth.Rds')

To find the optimal weights for this optimisation problem I used the synth function from the Synth package. The results of the optimisation is shown in Table 2. The outer optimisation solution is shown in Table 2 (a). The solution to the inner optimisation problem, given \(V^*\), is shown in Table 2 (b). These are the synthetic control weights. Only two of the regions have non-zero weights, Cataluna (0.851) and Madrid (0.149).

Code
basque_synth$solution.v |>
  as.data.frame() |>
  pivot_longer(everything(), names_to = 'Predictor', values_to = 'Weight') |>
  mutate(Predictor=variable_name_lookup[Predictor]) |>
  knitr::kable(digits = 3)
Code
basque_synth$solution.w |>
  as.data.frame() |>
  rownames_to_column('Region') |>
  mutate(Region=region_name_lookup[Region]) |>
  rename('Weight'=w.weight) |>
  knitr::kable(digits = 3)
Table 2: The solution to Equation 1 using the Basque Country data from Abadie and Gardeazabal (2003).
(a) The solution to the upper-level (outer) optimisation, \(V^*\).
Predictor Weight
Real per capita GDP 0.202
Investment ratio (percentage) 0.006
Population density 0.339
Agriculture, forestry, and fishing 0.095
Energy and water 0.008
Industry 0.133
Construction and engineering 0.010
Marketable services 0.002
Nonmarketable services 0.106
Illiterates 0.014
Primary or without studies 0.005
High school 0.045
More than high school 0.035
(b) The solution to the corresponding lower-level (inner) optimisation, \(W^*\).
Region Weight
Andalucia 0.000
Aragon 0.000
Baleares 0.000
Canarias 0.000
Cantabria 0.000
Castilla Y Leon 0.000
Castilla-La Mancha 0.000
Cataluna 0.851
Comunidad Valenciana 0.000
Extremadura 0.000
Galicia 0.000
Madrid 0.149
Murcia 0.000
Navarra 0.000
Principado De Asturias 0.000
Rioja 0.000
Spain 0.000
Code
conversion_list <- c(
  "Andalucía" = "Andalucia",
  "Aragón" = "Aragon",
  "Cantabria" = "Cantabria",
  "Castilla-La Mancha" = "Castilla-La Mancha",
  "Castilla y León" = "Castilla Y Leon",
  "Cataluña" = "Cataluna",
  "Ceuta y Melilla" = NA,  # No direct counterpart
  "Comunidad de Madrid" = "Madrid (Comunidad De)",
  "Comunidad Foral de Navarra" = "Navarra (Comunidad Foral De)",
  "Comunidad Valenciana" = "Comunidad Valenciana",
  "Extremadura" = "Extremadura",
  "Galicia" = "Galicia",
  "Islas Baleares" = "Baleares (Islas)",
  "Islas Canarias" = "Canarias",
  "La Rioja" = "Rioja (La)",
  "País Vasco" = "Basque Country (Pais Vasco)",
  "Principado de Asturias" = "Principado De Asturias",
  "Región de Murcia" = "Murcia (Region de)"
)
pca_obj <- princomp(t(basque.X),cor=T)
pc1_loadings <- as.matrix(pca_obj$loadings[,1])
pc1 <- as.data.frame(t(basque.X) %*% pc1_loadings) |>
  rownames_to_column('region') |>
  rename(PC1 = V1)
spain_sf <- sf::read_sf('spain-map/gadm41_ESP_1.shp')
basque_pc1 <- pc1 |> filter(region=='basque_country') |> pull(PC1)
joined_sf <- spain_sf |>
  mutate(region_nice = remove_bracket(as.character(conversion_list[NAME_1]))) |>
  drop_na(region_nice) |>
  mutate(region = clean_regions(region_nice)) |>
  left_join(pc1, by = join_by(region)) |>
  rowwise() |>
  mutate(x = sf::st_centroid(geometry)[[1]][1],
         y = sf::st_centroid(geometry)[[1]][2]) |>
  ungroup() |>
  mutate(PC1_prop = (PC1-basque_pc1)/basque_pc1)

To understand why the chosen regions are suitable, consider the first principal component visualised in Figure 5. While a principal component analysis is a fundamentally different approach, it appears to suggest the same conclusion, that the Basque Country’s characteristics are approximated by the weighted average of Madrid and Cataluna. Madrid is the only region with a higher first principal component than the Basque Country. Cataluna has a lower first principal component than the Basque Country, but is closer than other regions.

Code
pc1_prop_range <- joined_sf$PC1_prop |> range()
highlight_regions <- joined_sf|>filter(region%in%c('madrid','cataluna','basque_country'))
canary_islands_sf <- joined_sf|>
  filter(region=='canarias')|>
  mutate(geometry = sf::st_set_crs(sf::st_geometry(geometry)+c(6,6), value="WGS84"),
         bb = list(sf::st_bbox(geometry)))
canary_islands_bb <- canary_islands_sf$bb[[1]] |> as.matrix() |> t() |> as.data.frame()
p_pc1_map <- joined_sf |>
  filter(region!='canarias') |>
  ggplot() +
    geom_sf(aes(fill=PC1_prop)) +
    geom_sf(aes(fill=PC1_prop), data=canary_islands_sf) +
    geom_rect(aes(xmin=xmin,xmax=xmax,ymin=ymin,ymax=ymax), 
              data=canary_islands_bb,fill=NA,color='black') +
    geom_segment(aes(x,y,yend=y+0.5,xend=x-0.25),
              data=highlight_regions) +
    geom_label(aes(x,y,label=region_nice),
              data=highlight_regions,fontface='bold',color='black',nudge_y=.7,nudge_x=-0.2,
              hjust=1, label.r = unit(1,'mm'), label.padding = unit(.125,'lines'),
              fill='white') +
    coord_sf(clip = 'off') +
    expand_limits(x=-10,y=35) +
    scale_fill_gradient2(limits=pc1_prop_range,label=scales::label_percent(1)) +
    labs(fill='') +
    theme_void() +
    theme(legend.position = 'right')
p_pc1_map
Figure 5: The first principal component of the regions relative to the basque country. The formula is (PC1-(Basque Country’s PC1))/(Basque Country’s PC1). The Canary Islands are shown as an inset map.
Code
obs10 <- c("#4269d0", "#efb118", "#ff725c", "#6cc5b0", "#3ca951", "#ff8ab7", "#a463f2", "#97bbf5", "#9c6b4e", "#9498a0")

focus_regions <- c('madrid','cataluna','basque_country','synthetic_basque')
pca_synth <- t(basque_synth$solution.w)%*%pca_obj$scores[-1,] |>
  as.data.frame() |>
  mutate(region='synthetic_basque') |>
  tibble() |>
  relocate(region) |>
  mutate(region_nice = 'Synthetic Basque Country')
pca_df <- pca_obj$scores |>
  as.data.frame() |>
  rownames_to_column('region') |>
  mutate(region_nice = region_name_lookup[region]) |>
  bind_rows(pca_synth)
legend_labeller <- \(x)ifelse(x=='synthetic_basque', 'Synthetic Basque Country', region_name_lookup[x])
color_pal <- c(
      'madrid' = obs10[[1]],
      'cataluna' = obs10[[2]],
      'basque_country' = obs10[[3]],
      'synthetic_basque' = 'black',
      'synthetic_control' = 'black'
    )
get_scatter <- function(var1, var2, var1_title=var1, var2_title=var2) {
  plot_range <- range(c(pca_df[var1], pca_df[var2]))
  pca_df |>
    filter(!region %in% focus_regions) |>
    ggplot(aes(!!sym(var1), !!sym(var2))) +
    geom_point(pch=1) +
    geom_point(aes(shape = region, color = region),
               data = pca_df |>
                 filter(region %in% focus_regions),
               size=2.5) +
    coord_fixed(xlim = plot_range, ylim = plot_range) +
    scale_shape_manual(
      values = list(
        'madrid' = 16,
        'cataluna' = 17,
        'basque_country' = 15,
        'synthetic_basque' = 0
      ),
      label = legend_labeller
    ) +
    scale_color_manual(
      values = color_pal,
      label = legend_labeller
    ) +
    labs(color = '', shape = '',
         x=var1_title, y=var2_title) +
    theme_minimal() +
    theme(axis.ticks = element_line()) +
    theme(axis.title.x.bottom = element_text(margin=margin(t=5)),
          axis.title.y.left = element_text(margin=margin(r=5)),
          axis.title = element_text(size=8),
          legend.position = 'none')
}
p_pc1_pc2 <- get_scatter('Comp.1',
            'Comp.2',
            'First principal component',
            'Second principal component')
p_pc3_pc4 <- get_scatter('Comp.3',
              'Comp.4',
              'Third principal component',
              'Fourth principal component') 
p_pc5_pc6 <- get_scatter('Comp.5',
              'Comp.6',
              'Fifth principal component',
              'Sixth principal component') 
X0.W <- basque.X[,-1]%*%basque_synth$solution.w
colnames(X0.W) <- 'synthetic_basque'
predictors_long <- cbind(X0.W, basque.X) |>
  as.data.frame() |>
  rownames_to_column('predictor') |>
  pivot_longer(-predictor) |>
  mutate(predictor = unlist(variable_name_lookup[predictor]),
         region = region_name_lookup[name]) |>
  group_by(predictor) |>
  mutate(value = scale(value)) |>
  mutate(predictor = fct_inorder(predictor))
p_predictors <- predictors_long |>
  filter(!name%in%focus_regions) |>
  ggplot(aes(value, predictor)) +
  geom_point(aes(shape=name, color=name),
             data=predictors_long |>
               filter(name%in%focus_regions),
             size=3) +
  geom_point(shape = 1,alpha=.5) +
  scale_shape_manual(
    values = list(
      'madrid' = 16,
      'cataluna' = 17,
      'basque_country' = 15,
      'synthetic_basque' = 0
    ),
    label = legend_labeller
  ) +
  scale_color_manual(
    values = color_pal,
    label = legend_labeller
  ) +
  labs(
    y = '',
    x = 'Scaled predictor',
    color = '',
    shape = ''
  ) +
  theme_minimal() +
  theme(axis.ticks = element_line()) +
  theme(legend.position = 'top',
        axis.title.x.bottom = element_text(margin = margin(t = 10)))

How well the different predictors match using the synthetic control is visualised in Figure 6. It appears from this figure that Madrid is a useful inclusion in the synthetic Basque Country because its characteristics are often the minimum or maximum among the regions. This is consistent with the story told by the first principal component.

In Abadie and Gardeazabal (2003) the characteristics for the Basque Country, Spain, and the synthetic control are summarised in a table which I have reproduced in Table 3.

When considering how well characteristics are matched between the treated unit and its synthetic control it is important to remember that not all predictors are equally important. Based on the pre-intervention outcomes, the importance of each predictor is visualised in Figure 7.

Code
p_predictors
Figure 6: The value of predictors for the autonomous regions of Spain along with the corresponding values implied by the synthetic Basque Country.
Code
library(gt)
cbind(basque.X[,c('basque_country','spain')],
      basque.X[,-1,drop=F]%*%basque_synth$solution.w) |>
  as.data.frame() |>
  rownames_to_column('variable') |>
  rename('Synthetic control'=last_col()) |>
  relocate(last_col(), .after = 'spain') |>
  mutate(indent = str_detect(variable, '\\.')) |>
  gt(rowname_col = 'variable') |>
  tab_footnote(
    footnote = "1986 USD, average for 1960–1969.",
    locations = cells_stub('real_per_capita_gdp')
  ) |>
  tab_footnote(
    footnote = "Gross Total Investment/GDP, average for 1964–1969.",
    locations = cells_stub('investment_ratio')
  ) |>
  tab_footnote(
    footnote = "Persons per square kilometer, 1969.",
    locations = cells_stub('population_density')
  ) |>
  tab_header(
    title = "Pre-Terrorism Characteristics, 1960's"
  ) |>
  cols_label_with(fn=\(x)ifelse(x%in%names(region_name_lookup), region_name_lookup[x], x)) |>
  tab_row_group(rows = starts_with('sec.'), label = 'Sectoral shares (percentage)', id='sector') |>
  tab_row_group(rows = starts_with('human_capital.'), label = 'Human capital (percentage)', id='human_capital') |>
  row_group_order(groups = c(NA, 'sector', 'human_capital')) |>
  tab_footnote(
    footnote = "Percentages over total production, 1961–1969.",
    locations = cells_row_groups(groups='sector')
  ) |>
  tab_footnote(
    footnote = "Percentages over working-age population, 1964–1969.",
    locations = cells_row_groups(groups='human_capital')
  ) |>
  fmt_number(columns = -1) |>
  tab_style(
    style = cell_text(indent = px(20)),
    locations = cells_stub(
      rows = indent
    )
  ) |>
  tab_style(style = cell_borders(sides=c('top'),color = 'white',weight = px(2)), 
            locations = cells_row_groups(1)) |>
  cols_hide(indent) |>
  fmt(columns = 1, fns=\(x){as.character(variable_name_lookup[x])}) |>
  tab_options(footnotes.marks = 'letters')
Table 3: My reproduction of Table 3 from Abadie and Gardeazabal (2003) as a graph.
Pre-Terrorism Characteristics, 1960's
Basque Country Spain Synthetic control
Real per capita GDPa 5,285.47 3,633.25 5,270.79
Investment ratio (percentage)b 24.65 21.79 21.58
Population densityc 246.89 66.34 196.29
Sectoral shares (percentage)d
Agriculture, forestry, and fishing 6.84 16.34 6.18
Energy and water 4.11 4.32 2.76
Industry 45.08 26.60 37.64
Construction and engineering 6.15 7.25 6.95
Marketable services 33.75 38.53 41.10
Nonmarketable services 4.07 6.97 5.37
Human capital (percentage)e
Illiterates 3.32 11.66 7.65
Primary or without studies 85.97 80.15 82.33
High school 7.46 5.49 6.92
More than high school 3.26 2.70 3.10
a 1986 USD, average for 1960–1969.
b Gross Total Investment/GDP, average for 1964–1969.
c Persons per square kilometer, 1969.
d Percentages over total production, 1961–1969.
e Percentages over working-age population, 1964–1969.
Code
basque_synth$solution.v |>
  as.data.frame() |>
  pivot_longer(everything()) |>
  mutate(weight = value / sum(value)) |>
  ggplot(aes(fct_reorder(name, weight,.desc = T), weight)) +
    geom_bar(stat='identity',width=.75) +
    geom_text(aes(label=scales::label_number(0.1,scale=100)(weight)),
              vjust=-0.5, color='black', fontface='bold') +
    scale_y_continuous(labels = scales::label_number(scale = 100), expand = expansion(mult=0.025)) +
    scale_x_discrete(labels=\(x)variable_name_lookup[x], expand = expansion(mult=0.05)) +
    coord_cartesian(clip='off') +
    labs(x='',y='Relative importance (%)') +
    theme_minimal() +
    theme(panel.grid.major.x = element_blank(),
          axis.title.y.left = element_text(margin = margin(r=10)),
          axis.text.x.bottom = element_text(angle=45,hjust=1,vjust=1),
          plot.margin = margin(t=10),
          axis.ticks = element_line())
Figure 7: The importance of matching predictors based on pre-intervention data. This is essentially Table 2 (a) visualised.

As another diagnostic step we can also consider how well principal components match. The model doesn’t explicitly try to match principal components, but if the principal components approximately match then the predictors will as well. The principal components are also easier to visualise since only the first few are important.

The principal components of the regions and synthetic control are visualised in Figure 8. It appears that the first four principal components are reasonably similar for the Basque Country and its synthetic counterpart.

Code
pca_var <- as.numeric(pca_obj$sdev)^2
p_scree <- data.frame(
  component = 1:length(pca_var),
  variance_explained = pca_var
) |>
  ggplot(aes(component,variance_explained)) +
    geom_point(pch=1,size=2) +
    geom_line() +
    geom_hline(yintercept = 1,lty=2) +
    scale_x_continuous(breaks = 1:length(pca_var)) +
    labs(x='Principal Component',
         y='Eigenvalue') +
    theme_minimal() +
    theme(axis.title.x.bottom = element_text(margin=margin(t=10)),
          axis.title.y.left = element_text(margin=margin(r=10)),
          axis.ticks = element_line(),
          panel.grid.minor.x = element_blank(),
          panel.grid.major.x = element_blank())
Code
((p_pc1_pc2+p_pc3_pc4+p_pc5_pc6 & theme(legend.position='top')) + plot_layout(guides='collect'))/p_scree
Figure 8: The value of principal components for the autonomous regions of Spain along with the corresponding values implied by the synthetic Basque Country.

In this case study it is relatively easy to understand how the synthetic control is constructed since the weighted average only uses two regions. Over time the synthetic Basque Country is a line between the GDP per capita of Cataluna and Madrid, as shown in Figure 9.

Code
weights <- basque_synth$solution.w |> round(3)
d <- basque.y[,-1][,weights!=0,drop=F] |>
  as.data.frame() |>
  mutate(synthetic_basque=basque.y[,-1]%*%weights|>as.numeric()) |>
  rownames_to_column('year') |>
  mutate(year = as.numeric(year))
minimal_ts_theme <- theme_minimal() +
    theme(
      axis.text.x.bottom = element_text(),
      axis.ticks = element_line(),
      panel.grid = element_blank(),
      axis.text.y.left = element_text(),
      axis.title.x.bottom = element_blank(),
      plot.margin = margin(),
      axis.title.y.left = element_text(margin = margin(r = 10)),
      legend.position = c(0.05, 0.9),
      legend.title = element_text(size = 8),
      legend.text = element_text(size = 8),
      legend.key.size = unit(0.3, 'cm'),
      legend.direction = 'horizontal',
      legend.justification = c(0, 0)
    )
d |>
  pivot_longer(-year) |>
  ggplot() +
    geom_line(aes(x=year, y=value, color=name, lty=name), lwd=0.75) +
    scale_color_manual(
      values=color_pal,
      labels=legend_labeller
    ) +
    scale_linetype_manual(
      values=c('cataluna'=1,'madrid'=1,'synthetic_basque'=2),
      labels=legend_labeller
    ) +
    scale_y_continuous(labels = scales::label_dollar(scale = 1000), expand = expansion(mult=.01)) +
    scale_x_continuous(breaks = scales::pretty_breaks(n=10), expand = expansion(mult=0.01)) +
    labs(y='Real per capita GDP',color='',lty='') +
    minimal_ts_theme
Figure 9: The GDP per capita for Cataluna, Madrid, and their weighted average the synthetic Basque Country.

Finally, in Figure 10 we compare the Basque Country to the synthetic Basque Country. The gap between these, shown in Figure 11, is the estimated impact of terrorism in the Basque Country on its GDP per capita. The gap between the synthetic and actual Basque Country appears to coincide with the increase in terrorism activity in the late 1970s.

Code
weights <- basque_synth$solution.w |> round(3)
d2 <- basque.y[,c(T,weights!=0),drop=F] |>
  as.data.frame() |>
  mutate(synthetic_basque=basque.y[,-1]%*%weights|>as.numeric()) |>
  rownames_to_column('year') |>
  mutate(year = as.numeric(year))
d2 |>
  select(year,basque_country,synthetic_basque) |>
  pivot_longer(-year) |>
  ggplot() +
    geom_line(aes(x=year, y=value, color=name, lty=name), lwd=0.75) +
    scale_color_manual(
      values=color_pal,
      labels=legend_labeller
    ) +
    scale_linetype_manual(
      values=c('basque_country'=1,'synthetic_basque'=2),
      labels=legend_labeller
    ) +
    scale_y_continuous(labels = scales::label_dollar(scale = 1000), expand = expansion(mult=.01)) +
    scale_x_continuous(breaks = scales::pretty_breaks(n=10), expand = expansion(mult=0.01)) +
    labs(y='Real per capita GDP',color='',lty='') +
    minimal_ts_theme
Figure 10: The GDP per capita for Cataluna, Madrid, and their weighted average the synthetic Basque Country.
Code
year_range <- range(d2$year)
p_violence <- eta_violence |>
  mutate(Violence = Killings+Kidnappings) |>
  ggplot(aes(Year,Violence)) +
    geom_area(fill='darkred') +
    coord_cartesian(xlim = year_range) +
    minimal_ts_theme
p_effect <- d2 |>
  mutate(delta = basque_country - synthetic_basque) |>
  ggplot(aes(year, delta)) +
    geom_line() +
    coord_cartesian(xlim = year_range) +
    scale_y_continuous(labels = scales::label_dollar(scale = 1000), expand = expansion(mult=.01)) +
    geom_hline(yintercept=0,lty=2) +
    labs(y=expression( hat(delta)[t] )) +
    minimal_ts_theme
p_effect/p_violence
Figure 11: In the top panel is the estimated impact of terrorism on the GDP per capita for the Basque Country. The estimated treatment effect is given by \(\hat\delta_t=(Y_1)_t-(Y_0W^*)_t\), where \(Y_1\) is a T\(\times 1\) matrix of outcomes for the treated unit, \(Y_0\) is a T\(\times\)(number of controls) matrix of outcomes for the controls, and T is the number of periods. We include the pre-intervention period because it is a useful diagnostic check, since \(\delta_t=0\) prior to the intervention. In the lower panel is violence by ETA, where violence is defined as the sum of killings and kidnappings.

I hope that my version of the analysis have given some ideas for how the synthetic control method can be visualised and presented using a tidyverse approach.

References

Abadie, Alberto, and Javier Gardeazabal. 2003. “The Economic Costs of Conflict: A Case Study of the Basque Country.” American Economic Review 93 (1): 113132. https://doi.org/10.1257/000282803321455188.
Wickham, Hadley, Mine C-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Second edition. Beijing ; Sebastopol, CA: O’Reilly.

Footnotes

  1. This research is supported by an Australian Government Research Training Program (RTP) Scholarship.↩︎