Developing A Basic Map Data Pipeline

This post describes my work setting up a mapping pipeline. I was initially inspired by Eric Theise’s presentation on creating a Geostak. However, the tutorial was from a few years ago and I had trouble setting up some of the applications on my Ubuntu 18.04 machine. I used this as an opportunity to leverage docker containers, and specifically, the docker-compose functionality to run and synchronize multiple docker containers.

This post covers everything from downloading free map data, creating and styling map layers, and rendering the map dynamically in a browser. I choose to create a map of Pittsburgh, but it will work with any location. Also, the open source ecosystem supporting map data processing and visualization is pretty large. I only use a subset of tools in this pipeline but there are likely alternatives at each step of the pipeline which could be used.

Pipeline Components

PostGIS

The core components of the pipeline are a database to store the map data, a program to process the map data into vector tiles, and a web server to serve the files.

For the database, we will be using the open source database PostgreSQL. PostgreSQL is then spatially enabled with PostGIS, a PostgreSQL extension that adheres to the OpenGIS Simple Features Specification for SQL. A typical database is optimized for processing numeric and character data types. The PostGIS extension enables additional functionality for storing and querying data that represents objects defined in geometric space.

For more information configuring PostgreSQL with PostGIS, refer to the overview of kartoza’s docker-postgis container, which is used in this pipeline.

TileMill

For processing the mapping data, we will use MapBox’s TileMill. TileMill is a tool used to design maps for the web using custom data. TileMill can export files using the SQLite-based MBTiles file format. These maps are always projected to “Web Mercator” projection which can be displayed in the browser using tools like Leaflet.

This application has been deprecated in favor of MapBox Studio, but it can be run in a docker container and it has enough basic functionality for the purposes of this project.

Getting Data

Our primary source of data will be from the OpenStreetMap project. This is a long-running open map project built on user-contributed data.

Interline provides city and region sized extracts of the entire OpenStreetMap planet. These are updated on a daily basis. You need to sign up for a free account to get an API token in order to download data. In this example, I’m creating a map of Pittsburg, so I’m going to download the extract for Pittsburgh.

If you are interested in other datasets, LearnOSM has a guide to even more ways to get OSM data.

Running the Pipeline

Now that we have the data downloaded, we want to use our pipeline to process it. To get PostgreSQL/PostGIS and Tilemill up and running quickly, we will use Hans Meine’s tilemill_docker repository.

Clone the repository and navigate to the osm-bright directory. Inside this directory is the docker-compose file which defines and configures the containers that will be run. We can see that kartoza/postgis docker image will be loaded as well as the hansmeine/osm-bright docker image.

Also take note of the PostgreSQL environment variables. We’ll use these later on to extract data from the database into TileMill.

PGHOST=postgis
PGDATABASE=gis
PGUSER=docker
PGPASSWORD=docker

From the directory, run docker-compose up to start the services.
cd osm-bright
docker-compose up

You can run docker container ls to verify that the containers are up and running.

Both of these containers are now talking to each other. You should be able to connect to http://localhost:20009 and see the Tilemill GUI.

Loading the Data into PostGIS

OSM data comes in two formats, XML and PBF (Protocolbuffer Binary Format). The download from interline will have the .osm.pbf file extension.

First, we want to copy the data from our host machine into the tilemill docker container. We’ll use the docker cp command to copy the OSM extract (pittsburgh_pennsylvania.osm.pbf) to the root directory of the tilemill container.

docker cp ~/Documents/MapBox/pittsburgh_pennsylvania.osm.pbf tilemill:/pitt.osm.pbf

Next, we use imposm to import the OpenStreetMap data into the PostgreSQL/PostGIS databases. We will use docker exec to run a command in the tilemill container to import the pitt.osm.pbf data into the database named gis. The database name, host, username, and password can all be referenced from the environment variables stored in the docker-compose.yaml file discussed previously.

The full command is:
docker exec tilemill imposm -m /osm-bright/imposm-mapping.py --connection postgis://docker:docker@postgis/gis --read --write --optimize --deploy-production-tables /pitt.osm.pbf

Imposm creates about a dozen tables for the most important features. Default tables are categorized as point (such as places) , polygon (such as water areas), or linestring (such as motorways) tables.

We can log into the postgis container to examine the database. First, log into the container with an interactive session using the following command:

docker exec -tiu postgres postgis psql

Connect to the gis database from the command line by typing:
\c gis

The \d command can then be run to describe the database. This displays a list of the different table names and types

It’s also possible to look at the schema of individual tables. To examine the osm_mainroads table run
\d osm_mainroads.

The output appears below.

You can view the mapping from OSM data types to imposm tables here.

Styling the Map Data

Now that the data is loaded into the PostGIS database, we will use TileMill to load and style the data. Once we are happy with the map content and styling, we will export the data.

First, we can connect to the TileMill container and load the GUI on our host machine by navigating to to http://localhost:20009 in the browser. Create a new project using the form.

Now we want to start loading some data. We will start with adding a layer for the motorway data.

For the connection field, we will populate it using the PostGIS database information.
dbname=gis host=postgis user=docker password=docker

For the table field, we will just load the entire table, osm_motorways. The unique key field will be osm_id. When adding additional data, we will keep the connection and unique key field the same and just update the table field.

Save the query and navigate to the style.mss file in the side panel. Here we will enter some basic styling parameters to view the content we just loaded. Add the following code block to the end of the file and save.

#osmmotorways {
line-width:1;
line-color:#69ab1a;
}

Now navigate and zoom into the Pittsburgh region and the roads should be visible.

We can repeat this process, creating layers for the osm_minorroads and osm_mainroads tables and using the following styling blocks:

#osmminorroads{
line-width: 1;
line-color: #168;
}

#osmmainroads {
line-width:1;
line-color:#bd5e52;
}

The resulting image is shown below.

In it’s current form, the styling is fairly generic. All of the different subclasses of data are styled the same and don’t change based on the zoom level.

We can create a more specific set of styling commands that distinguishes between the different subclasses and changes the style based on different zoom levels.

Delete the code blocks added to the style.mss file. Click on the + to create a new Carto stylesheet and name it roads.mss. Copy the content from this file roads.mss and save.

Looking at the stylesheet, we can see that the mainroads data is broken down into the primary, secondary, and tertiary types. Then, based on the zoom level, the line-width and line-color properties are set. This distinguishes the different types by color and also vary the thickness of the line by the zoom level to create a dynamic map.

The map with the new stylesheet should look similar to the image below:

We can follow a similar process to load and view water data. We create new layers for the osm_waterareas and osm_waterways tables and style them using the water.mss file.

The updated map with the water features is shown below:

Lastly, we can visualize the name field in each table to label the different features of the map. For this example, we will just label the road features. Since we want to style the labels differently from the features, we will create a new layer for the labels and style the labels specifically.

We’ll start by creating a new layer for each of the road data tables, motorways, mainroads, and minorroads. This time, we will use a SQL query to only extract a subset of data from the table. In the table field, use the following query to create a layer for the motorways table for the label.

(SELECT name, osm_id, type, geometry
FROM osm_motorways)
as motorway_text

Use a similar query for the other road data tables. Next, update the roads.mss style sheet with styling for these specific layers using this file as a reference.

Save the style and the updated map should appear as below.

Exporting the Map Data

Once all the map layers are loaded and styled, we can export the content into MBTiles. MBTiles is an open specification for storing map tiles in a SQLite database.

Using the export GUI, set the bounding box dimensions and zoom levels you want to extract and export the files.

The exported file, for example pitt_from_osm.mbtiles is now available in the tilemill docker container.

We can copy the file to our host machine using the docker cp command.

docker cp tilemill:/root/Documents/MapBox/export/pitt_from_osm.mbtiles .

Extracting and Serving the Data

Any web server can serve tiles as individual image files, organized into z/x/y subdirectories. MBUtil is a utility for importing and exporting the MBTiles format, typically created with Mapbox TileMill.

We’ll use mbutil to export our mbtiles file to a directory of files.

docker exec -tiu root tilemill bash
apt-get install python-setuptools
easy_install mbutil

Once installed, run the primary mbutil command:

mb-util pitt_from_osm.mbtiles ./tiles/

Once the export process is finished, you can view the tiles directory. It contains a folder for each of the zoom levels exported. Inside each directory contains .png images of the map organized by their x and y dimensions.

Python has a built-in HTTP server that provides standard GET and HEAD request handlers. This doesn’t require any additional installations or configuration and is useful when you want to turn a directory into a quick web server.

python -m SimpleHTTPServer 8887

We specify port 8887 since the tilemill container was configured with this port open.

Navigate to http://localhost:8887 and navigate the directory until you get to a single tile. The tile is displayed in the web browser as shown below.

Python SimpleHTTPServer Serving Map Images

Viewing the Data

Now that we can serve the mapping tiles, we want to render them in the browser. Leaflet is a leading open-source JavaScript library for mobile-friendly interactive maps.

Eric Theise has some HTML files already pre-built. Clone his geostack-map-pages repository from GitHub.

We’ll make some minor modifications to the static.html file.

update the setView command with latitude and longitude values of the area you are mapping.
update the tileUrl value
update the marker coordinates and popup text.

Once the updates are made, you can drag the file into your browser and view the map.

Wil Selby

thoughts on robotics, product management, and leadership

Developing A Basic Map Data Pipeline

Pipeline Components

PostGIS

TileMill

Getting Data

Running the Pipeline

Loading the Data into PostGIS

Styling the Map Data

Exporting the Map Data

Extracting and Serving the Data

Viewing the Data

Related

Pipeline Components

PostGIS

TileMill

Getting Data

Running the Pipeline

Loading the Data into PostGIS

Styling the Map Data

Exporting the Map Data

Extracting and Serving the Data

Viewing the Data

Related

Discover more from Wil Selby