Using Free Geospatial Tools and Data Part 9: Maintaining your Geospatial Database

Now that you have a lot of data into your geospatial database, we should take a little bit of time to discuss how to manage it and keep it running smoothly.  This can make the difference between requests timing out and having data be returned almost instantaneously.

Database Tuning

There are a large number of sources online that will go over how to configure PostgreSQL for maximum performance.  The PostgreSQL team themselves provide such documentation at the Tuning Your PostgreSQL Server wiki page and a list of many techniques at their Performance Optimization wiki page.

For the PostGIS side of things, Boundless has a workshop page titled Tuning Postgres for Spatial that provides some information on configuring for spatial operations.  The PostGIS team also has some tips that can be found at this link.

Another tool is the pgtune utility.  This is a command line tool that lets you specify what you will be using your database for and what type of machine it is running on.  It then will give you several options that you can put in the PostgreSQL configuration files.

Table Maintenance

In general, the main things you should consider are creating indices, vacuuming your database to remove unused space, keeping your database statistics up to date, and clustering your database so that data that is spatially near each other are physically near each other on disk.

To perform the following commands, open a command prompt and run the command:

psql -d Census_2013

You should then see output similar to the following:

[bmaddox@girls ~]$ psql -d Census_2013 
psql (9.2.7)
Type "help" for help.

Census_2013=#

If you followed the commands in this series, your database tables should already have a spatial index based on the geom column that lets PostGIS only return data that you have requested.  To see this, first run the command

\dt

This will give you a list of all of the tables in your database.  On my system, I get the output below:

Census_2013=# \dt
 List of relations
 Schema | Name | Type | Owner 
--------+-----------------------------------------+-------+---------
 public | county_outlines | table | bmaddox
 public | spatial_ref_sys | table | bmaddox
 public | state_outlines | table | bmaddox
 public | us_113_congress_districts | table | bmaddox
 public | us_alaska_native_regional_corporations | table | bmaddox
 public | us_area_landmarks | table | bmaddox
 public | us_area_water | table | bmaddox
 public | us_census_block_groups | table | bmaddox
 public | us_coastlines | table | bmaddox
 public | us_combined_new_england_city_town_areas | table | bmaddox
 public | us_combined_statistical_areas | table | bmaddox
 public | us_elementary_school_districts | table | bmaddox
 public | us_indian_alaska_hawaii_native_areas | table | bmaddox
 public | us_indian_tribal_subdivisions | table | bmaddox
 public | us_linear_water | table | bmaddox
 public | us_metro_micropolitan_statistical_areas | table | bmaddox
 public | us_military_areas | table | bmaddox
 public | us_new_england_city_town_areas | table | bmaddox
 public | us_new_england_city_town_divisions | table | bmaddox
 public | us_primary_roads | table | bmaddox
 public | us_primary_secondary_roads | table | bmaddox
 public | us_rails | table | bmaddox
 public | us_roads | table | bmaddox
 public | us_secondard_school_districts | table | bmaddox
 public | us_state_legislative_lower | table | bmaddox
 public | us_state_legislative_upper | table | bmaddox
 public | us_tribal_block_groups | table | bmaddox
 public | us_unified_school_districts | table | bmaddox
 public | us_urban_areas_2010 | table | bmaddox
 public | us_zip_code_areas | table | bmaddox
(30 rows)

Census_2013=#

To get further details on a table, including the indices, run the following:

Census_2013=# \d us_zip_code_areas
 Table "public.us_zip_code_areas"
 Column | Type | Modifiers 

------------+-----------------------------+---------------------------------------------------------
--------
 gid | integer | not null default nextval('us_zip_code_areas_gid_seq'::re
gclass)
 zcta5ce10 | character varying(5) | 
 geoid10 | character varying(5) | 
 classfp10 | character varying(2) | 
 mtfcc10 | character varying(5) | 
 funcstat10 | character varying(1) | 
 aland10 | double precision | 
 awater10 | double precision | 
 intptlat10 | character varying(11) | 
 intptlon10 | character varying(12) | 
 geom | geometry(MultiPolygon,4269) | 
Indexes:
 "us_zip_code_areas_pkey" PRIMARY KEY, btree (gid)
 "us_zip_code_areas_geom_gist" gist (geom)

Census_2013=#

The geospatial index here is the us_zip_code_areas_geom_gist  while the standard primary key index is us_zip_code_areas_pkey.

When you create a new index, or perform a large number of inserts, updates, or deletes, you generally need to update the database’s statistics on that table as well as clear out any unused space.  Pick one of your tables and run the following command, noting that depending on the size of your database (especially with US_Roads, for example), this command can take a while to complete:

Census_2013=# vacuum analyze us_zip_code_areas;
VACUUM
Census_2013=#

This command tells PostgreSQL to clean up any unused space and update statistics on the us_zip_code_areas table.  These statistics are used internally by the database when it runs user queries on the data.

The next thing you should do is cluster the data.  Clustering is an operation that is performed against a database index and it physically places data that is close to each other in the index in the same area on the hard drive.  For geospatial data, this can make a huge difference, as the database does not have to scan multiple areas on the hard drive to return data that your GIS may request.  To cluster data on the geospatial index, run the following command next, again noting that this could take a while to finish depending on the size of your database and your hardware:

Census_2013=# cluster us_zip_code_areas using us_zip_code_areas_geom_gist ;
CLUSTER
Census_2013=#

Once finished, run the table describe command (\d) again:

Census_2013=# \d us_zip_code_areas
 Table "public.us_zip_code_areas"
 Column | Type | Modifiers 

------------+-----------------------------+---------------------------------------------------------
--------
 gid | integer | not null default nextval('us_zip_code_areas_gid_seq'::re
gclass)
 zcta5ce10 | character varying(5) | 
 geoid10 | character varying(5) | 
 classfp10 | character varying(2) | 
 mtfcc10 | character varying(5) | 
 funcstat10 | character varying(1) | 
 aland10 | double precision | 
 awater10 | double precision | 
 intptlat10 | character varying(11) | 
 intptlon10 | character varying(12) | 
 geom | geometry(MultiPolygon,4269) | 
Indexes:
 "us_zip_code_areas_pkey" PRIMARY KEY, btree (gid)
 "us_zip_code_areas_geom_gist" gist (geom) CLUSTER

Census_2013=#

As you can see, the geom index now has the world CLUSTER behind it, denoting that this is the index that you clustered on.  It will generally make more sense for geospatial databases to cluster on the geospatial index as that will ensure data that is near each other in the real world is near each other on disk.

However, now that you have moved data around on disk, you again need to update the database statistics.  You will not need to vacuum it again, as the cluster command does this itself as it rearranges the data.

Census_2013=# analyze us_zip_code_areas ;
ANALYZE
Census_2013=#

So there you are.  Running these commands on your data tables, in addition to tuning your database itself, can make a huge performance difference over simply accepting the default options.

Next time we will go over backing up your database and then head on to loading data again, such as the USGS Geographic Names Information System (GNIS), the National Geospatial-Intelligence Agency’s Geonames, and OpenStreetMap.