August | 2018 | Brian's Geek Blog

At my day job I am working on some natural language processing and need to generate a list of place names so I can further train the excellent spacy library. I previously imported the full Planet OSM so went there to pull a list of places. However, the place names in OSM are typically in the language of the person who did the collection, so they can be anything from English to Arabic. I stored the OSM data using imposm3 and included a PostgreSQL hstore column to store all of the user tags so we would not lose any data. I did a search for all tags that had values like name and en in them and exported those keys and values to several CSV files based on the points, lines, and polygons tables. I thought I would write a quick post to show how easy it can be to manipulate data outside of traditional spreadsheet software.

The next thing I needed to do was some data reduction, so I went to my go-to library of Pandas. If you have been living under a rock and have not heard of it, Pandas is an exceptional data processing library that allows you to easily manipulate data from Python. In this case, I knew some of my data rows were empty and that I would have duplicates due to how things get named in OSM. Pandas makes cleaning data incredibly easy in this case.

First I needed to load the files into Pandas to being cleaning things up. My personal preference for a Python interpreter is ipython/jupyter in a console window. To do this I ran ipython and then imported Pandas by doing the following:

In [1]: import pandas as pd

Next I needed to load up the CSV into Pandas to start manipulating the data.

In [2]: df = pd.read_csv('osm_place_lines.csv', low_memory=False)

At this point, I could examine how many columns and rows I have by running:

In [3]: df.shape
Out[3]: (611092, 20)

Here we can see that I have 611,092 rows and 20 columns. My original query pulled a lot of columns because I wanted to try to capture as many pre-translated English names as I could. To see what all of the column names are, I just had to run:

In [10]: df.columns
Out[10]: 
Index(['name', 'alt_name_1_en', 'alt_name_en', 'alt_name_en_2',
       'alt_name_en_3', 'alt_name_en_translation', 'en_name',
       'gns_n_eng_full_name', 'name_en', 'name_ena', 'name_en1', 'name_en2',
       'name_en3', 'name_en4', 'name_en5', 'name_en6', 'nam_en', 'nat_name_en',
       'official_name_en', 'place_name_en'],
      dtype='object')

The first task I then wanted to do was drop any rows that had no values in them. In Pandas, empty cells default to the NaN value. So to drop all the empty rows, I just had to run:

In [4]: df = df.dropna(how='all')

To see how many rows fell out, I again checked the shape of the data.

In [5]: df.shape
Out[5]: (259564, 20)

Here we can see that the CSV had 351,528 empty rows where the line had no name or English name translations.

Next, I assumed that I had some duplicates in the data. Some things in OSM get generic names, so these can be filtered out since I only want the first row from each duplicate. With no options, drop_duplicates() in Pandas only keeps the first value.

In [6]: df = df.drop_duplicates()

Checking the shape again, I can see that I had 68,131 rows of duplicated data.

In [7]: df.shape
Out[7]: (191433, 20)

At this point I was interested in how many cells in each row still contained no data. The CSV was already sparse since I converted each hstore key into a separate column in my output. To do this, I ran:

In [8]: df.isna().sum()
Out[8]: 
name                          188
alt_name_1_en              191432
alt_name_en                190310
alt_name_en_2              191432
alt_name_en_3              191432
alt_name_en_translation    191432
en_name                    191430
gns_n_eng_full_name        191432
name_en                    191430
name_ena                   172805
name_en1                   191409
name_en2                   191423
name_en3                   191429
name_en4                   191430
name_en5                   191432
name_en6                   191432
nam_en                     191432
nat_name_en                191431
official_name_en           191427
place_name_en              191429
dtype: int64

Here we can see the sparseness of the data. Considering I am now down to 191,433 columns, some of the columns only have a single entry in them. We can also see that I am probably not going to have a lot of English translations to work with.

At this point I wanted to save the modified dataset so I would not loose it. This was a simple

In [8]: df.to_csv('osm_place_lines_nonull.csv', index=False)

The index=False option tells Pandas to not output its internal index field to the CSV.

Now I was curious what things looked like, so I decided to check out the name column. First I increased some default values in Pandas because I did not want it to abbreviate rows or columns.

pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 25)

To view the whole row where the value in a specific column is null, I did the following and I will abbreviate the output to keep the blog shorter 🙂

df[df['name'].isnull()]
...
       name_en                                 name_ena name_en1 name_en2  \
166        NaN                        Orlovskogo Island      NaN      NaN   
129815     NaN                            Puukii Island      NaN      NaN   
159327     NaN                           Ometepe Island      NaN      NaN   
162420     NaN                                  Tortuga      NaN      NaN   
164834     NaN                         Jack Adan Island      NaN      NaN   
191664     NaN                            Hay Felistine      NaN      NaN   
193854     NaN             Alborán Island Military Base      NaN      NaN   
197893     NaN                         Carabelos Island      NaN      NaN   
219472     NaN                           Little Fastnet      NaN      NaN   
219473     NaN                             Fastnet Rock      NaN      NaN   
220004     NaN                           Doonmanus Rock      NaN      NaN   
220945     NaN                             Tootoge Rock      NaN      NaN   
229446     NaN                               Achallader      NaN      NaN   
238355     NaN                            Ulwile Island      NaN      NaN   
238368     NaN                             Mvuna Island      NaN      NaN   
238369     NaN                            Lupita Island      NaN      NaN   
238370     NaN                              Mvuna Rocks      NaN      NaN   
259080     NaN                                  Kafouri      NaN      NaN   
259235     NaN                              Al Thawra 8      NaN      NaN   
259256     NaN                              Beit al-Mal      NaN      NaN   
261584     NaN                                   Al Fao      NaN      NaN   
262200     NaN                                  May 1st      NaN      NaN   
...

Now that I have an idea how things look, I can do things like fill out the rest of the name columns with the English names found the various other columns.

Brian's Geek Blog

My Home on the Internet

Monthly Archives: August 2018

Manipulating a CSV with Pandas