4 Putting a directory on the map

4.1 UK geography

The principal body of reference to understand British local borders is the Office for National Statistics. A beginner’s guide on types of borders can be found here. On its website, it states:

There are many different geographic unit types (administrative, health, electoral, postcode etc) and their boundaries frequently do not align. A range of geographies are liable to frequent revisions. Geographies used to produce National Statistics, includes administrative, UK electoral, census, Eurostat and postal geographies.

When it comes to linking a geographic space to local news items and local news producers, it makes sense to think of it from the perspective of the news item itself, which is defined by the question: who does this article target? At scale, one can define the geographic audience of a local newspaper by measuring the average of the audiences of individual articles (obviously the lines of who is the audience and who is not are blurry, but one has to make choices and define some lines). The philosophy here stated transcends political boundaries, and could identify an audience as a city, a village, a region, a trade group within one region, and so on. So far, research has however looked at linking not the news pieces to a location but the news producers, using the office location as a proxy of targeted geographical audience. The type of geographic unit mostly used is that of LADs (Ramsay & Moore), which corresponds to Local Authority District, however other studies use postcodes (Gulyas), and for other data I have coordinates (hyperlocals by ICNN)

As an initial test for my news directory, this chapter uses the office location of each outlet and maps it to a LAD, to make a comparison to previous efforts and determine any temporal changes in the landscape.

The file used comes from the Open Geography Portal from the Office for National Statistics. The four boundary types (BFE, BFC, BGC, BUC) exist to be used in different contexts (such as analysis or visualisation), as they have different levels of precision. The latest guide to today (24th January) can be found here: https://geoportal.statistics.gov.uk/documents/boundary-dataset-guidance-2011-to-2021/explore

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
directory_final <- readRDS("directory_with_hyperlocals.RDS")
lads <- read_csv("./files/Local_Authority_Districts_(December_2022)_Boundaries_UK_BFE.csv")
## Rows: 374 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): LAD22CD, LAD22NM
## dbl (7): OBJECTID, BNG_E, BNG_N, LONG, LAT, SHAPE_Length, SHAPE_Area
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#devtools::install_github("ropensci/PostcodesioR")
library(PostcodesioR)
geo_data <-
  directory_final %>%
  select(Publication,
         id,
         Area,
         City,
         Postcode,
         Address1,
         Address2,
         Coordinates) %>%
  filter_at(vars(Area, City, Postcode, Address1, Address2),
            all_vars(is.na(.))) %>%
  filter(!is.na(Coordinates)) %>%
  separate(Coordinates, into = c("Longitude", "Latitude") , " ")

limit <- rep(1, 85)

# prepare object for function
geolocations_list <- structure(list(
  geolocations = structure(
    list(
      longitude = geo_data$Longitude,
      latitude = geo_data$Latitude,
      limit = limit
    ),
    .Names = c("longitude", "latitude", "limit"),
    class = "data.frame",
    row.names = geo_data$Publication
  )
),
.Names = "geolocations")

# retrieve nearest postcode to each coordinate
postcodes <- bulk_reverse_geocoding(geolocations_list)
names(postcodes) <- geo_data$Publication

# function to extract variables of interest
extract_bulk_geo_variable <- function(x) {
  bulk_results <- lapply(postcodes, `[[`, "result")
  sapply(unlist(bulk_results, recursive = FALSE), `[[`, x)
}

# define the variables you need
variables_of_interest <-
  c("postcode",
    "latitude",
    "longitude",
    "country",
    "admin_district")

# return a data frame with the variables
postcodes <-
  data.frame(sapply(variables_of_interest, extract_bulk_geo_variable)) %>%
  rownames_to_column("Publication")

###### MAKING AN OFFICIAL GEO DIRECTORY 
directory_geo <- 
  directory_final %>%
  select(Publication,
         id,
         Area,
         City,
         Postcode,
         Address1,
         Address2,
         Coordinates) %>%
  left_join(postcodes, by = "Publication") %>%
  mutate(name_to_lad = if_else(is.na(admin_district), Area, admin_district))

# how many observations completely lack a geographic reference?
missing_geo <- filter(directory_geo, is.na(name_to_lad))
#write_delim(missing_geo, "missing_geo.csv", delim = ";")

While looking for missing postcodes manually, I verified that some publications are no longer in existence or that I had duplicates in my dataset. After a full verification of my dataset, I recounted the titles and I have gone down to 1163 (from 1176).

This are some of the verifications made:

  • Ilkeston Inquirer inactive

  • The New Blackmore Vale duplicate x2

  • Vale Journal inactive

  • Aberdeen Live duplicate

  • Barnet Post duplicate

  • Bath Telegraph unfound

  • Cheltenham Post duplicate

  • Felixstowe Extra unfound

  • NN Journal duplicate

  • Presslocal (SW London) inactive

  • Reading Today duplicate

  • Stroud Times duplicate

# reimporting the file which I have manually compiled in terms of missing postcodes
missing_geo <- read_delim("missing_geo.csv")
## Rows: 114 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (4): Publication, id, Postcode, Notes
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# some are just lost causes (in my manual research I found they were defunct as titles, or that I had duplicates)
pc <- missing_geo %>%
  filter(!Postcode %in% c("duplicate", "inactive", "unfound"))

# let's retrieve admin district for full postcodes
pc_full <- filter(pc, nchar(Postcode) > 4)
pc_list <- list(postcodes = pc_full$Postcode)
bulk_lookup_result <- bulk_postcode_lookup(pc_list)
bulk_list <- lapply(bulk_lookup_result, "[[", 2)
bulk_df <-
  map_dfr(bulk_list,
          `[`,
          c("postcode", "longitude", "latitude", "admin_district")) %>%
  mutate(postcode = str_remove(postcode, " "))

# and for partial ones
pc_partial <- filter(pc, nchar(Postcode) < 5)
pc_partial_list <- lapply(pc_partial$Postcode, outward_code_lookup)
pc_partial_list1 <- function(x) {
  sapply(pc_partial_list, `[[`, x)
}

variables_of_interest <-
  c("outcode", "latitude", "longitude", "admin_district")

partial_postcodes <-
  data.frame(sapply(variables_of_interest, pc_partial_list1)) %>% unnest(admin_district) %>% mutate_if(is.list, as.character) %>% 
  mutate_at(c("latitude", "longitude"), as.numeric)

############ LET S PUT IT ALL BACK TOGETHER!

missings <- 
  missing_geo %>%
  left_join(bulk_df, by = c("Postcode" = "postcode")) %>%
  left_join(partial_postcodes, by = c("Postcode" = "outcode")) %>%
  mutate(admin_district = coalesce(admin_district.x, admin_district.y),
         latitude = coalesce(latitude.x, latitude.y),
         longitude = coalesce(longitude.x, longitude.y)) %>%
  select(-contains(c(".x", ".y"))) %>% 
distinct()
## Warning in left_join(., bulk_df, by = c(Postcode = "postcode")): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 10 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.
## Warning in left_join(., partial_postcodes, by = c(Postcode = "outcode")): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 5 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.
directory_geo_final <- 
  directory_geo %>%
  mutate_at(c("latitude", "longitude"), as.numeric) %>%
  left_join(missings, by = c("Publication")) %>%
  mutate(
    admin_district = coalesce(admin_district.x, admin_district.y),
    latitude = coalesce(latitude.x, latitude.y),
    longitude = coalesce(longitude.x, longitude.y, ),
    Postcode = coalesce(Postcode.x, Postcode.y),
    postcode = coalesce(Postcode, postcode)
  ) %>%
  select(-contains(c(".x", ".y"))) %>%
  distinct() %>%
  mutate(LAD = coalesce(name_to_lad, admin_district)) %>%
  filter(!Postcode %in% c("duplicate", "inactive", "unfound")) %>%
  select(
    -c(
      Area,
      City,
      Address1,
      Address2,
      country,
      name_to_lad,
      admin_district,
      Notes,
      Postcode,
      Coordinates
    )
  )
## Warning in left_join(., missings, by = c("Publication")): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 63 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.
recount <-
  directory_geo_final %>% distinct(Publication) # 1163 titles

saveRDS(directory_geo_final, "directory_geo_final.rds")

4.2 Analysis

# cleaning up names, some were old, some were laid out differently to the official LAD file from ONS
directory_geo_final <- 
  directory_geo_final %>%
  mutate(
    LAD = case_match(
      LAD,
      "Kingston" ~ "Kingston upon Thames",
      "Ards" ~ "Ards and North Down",
      "Blackburn" ~ "Blackburn with Darwen",
      "Armagh" ~ "Armagh City, Banbridge and Craigavon",
      "Bournemouth" ~ "Bournemouth, Christchurch and Poole",
      "Fermanagh" ~ "Fermanagh and Omagh",
      "Herefordshire" ~ "Herefordshire, County of",
      "Lisburn" ~ "Lisburn and Castlereagh",
      "Northampton" ~ "North Northamptonshire",
      "City of Glasgow" ~ "Glasgow City",
      "Ballymena" ~ "Antrim and Newtownabbey",
      "Antrim" ~ "Antrim and Newtownabbey",
      "Shepway" ~ "Folkestone and Hythe",
      "Ballymoney" ~ "Antrim and Newtownabbey", 
      "Coleraine" ~ "Derry City and Strabane",
      "Banbridge" ~ "Armagh City, Banbridge and Craigavon", 
      "Aylesbury Vale" ~ "Buckinghamshire", 
      "West Dorset" ~ "Dorset",
      "City of Bristol" ~ "Bristol, City of",
      "Wycombe" ~ "Buckinghamshire",
      "St Edmundsbury" ~ "West Suffolk",
      "Larne" ~ "Antrim and Newtownabbey", 
      "North Down" ~ "Ards and North Down", 
      "Daventry" ~ "West Northamptonshire",
      "Derry/Londonderry" ~ "Derry City and Strabane",
      "Down" ~ "Newry, Mourne and Down",
      "Dungannon" ~ "Mid Ulster", 
      "City of Kingston upon Hull" ~ "Kingston upon Hull, City of",
      "Limavady" ~ "Causeway Coast and Glens", 
      "Craigavon" ~ "Armagh City, Banbridge and Craigavon",
      "Cookstown" ~ "Mid Ulster", 
      "Forest Heath" ~ "West Suffolk",
      "Newry and Mourne" ~ "Newry, Mourne and Down", 
      "Kettering" ~ "North Northamptonshire",
      "Richmond-upon-Thames" ~ "Richmond upon Thames", 
      "Taunton Deane" ~ "Somerset West and Taunton",
      "St Helens" ~ "St. Helens", 
      "Western Isles" ~ "Na h-Eileanan Siar",
      "North Dorset" ~ "Dorset", 
      "Strabane" ~ "Derry City and Strabane",
      "Purbeck" ~ "Dorset",
      "PL19 0HE" ~ "West Devon", 
      "Omagh" ~ "Fermanagh and Omagh",
      "Suffolk Coastal" ~ "East Suffolk",
      "West Somerset" ~ "Somerset West and Taunton", 
      "Bristol" ~ "Bristol, City of",
      "Waveney" ~ "East Suffolk",
      .default=LAD
)
)

saveRDS(directory_geo_final, "directory_geo_final.rds")

no_titles_lads <- lads %>% anti_join(directory_geo_final, by = c("LAD22NM" = "LAD")) %>% distinct(LAD22NM) %>% drop_na()

titles_by_lad <- directory_geo_final %>% 
   count(LAD, sort = desc(TRUE))

write_csv(titles_by_lad, "titles_by_lad.csv")
mean(titles_by_lad$n)
## [1] 4.037162
median(titles_by_lad$n)
## [1] 3
sd(titles_by_lad$n)
## [1] 3.728312

4.3 Visualisation

# library(geojsonio)
# spdf <- geojson_read("./files/Local_Authority_Districts_(December_2022)_Boundaries_UK_BUC.geojson",  what = "sp")
# 
# # 'fortify' the data to get a dataframe format required by ggplot2
# library(broom)
# spdf_fortified <- tidy(spdf)
# 
# # Plot it
# library(ggplot2)
# ggplot() +
#   geom_polygon(data = spdf_fortified, aes( x = long, y = lat, group = group, fill=group), color="white") +
#   theme_void() +
#   theme(legend.position = "none")+
#   coord_map()

# I will carry it out in QGIS