At my day job I am working on some natural language processing and need to generate a list of place names so I can further train the excellent spacy library. I previously imported the full Planet OSM so went there to pull a list of places. However, the place names in OSM are typically in the language of the person who did the collection, so they can be anything from English to Arabic. I stored the OSM data using imposm3 and included a PostgreSQL hstore column to store all of the user tags so we would not lose any data. I did a search for all tags that had values like name and en in them and exported those keys and values to several CSV files based on the points, lines, and polygons tables. I thought I would write a quick post to show how easy it can be to manipulate data outside of traditional spreadsheet software.
The next thing I needed to do was some data reduction, so I went to my go-to library of Pandas. If you have been living under a rock and have not heard of it, Pandas is an exceptional data processing library that allows you to easily manipulate data from Python. In this case, I knew some of my data rows were empty and that I would have duplicates due to how things get named in OSM. Pandas makes cleaning data incredibly easy in this case.
First I needed to load the files into Pandas to being cleaning things up. My personal preference for a Python interpreter is ipython/jupyter in a console window. To do this I ran ipython and then imported Pandas by doing the following:
In [1]: import pandas as pd
Next I needed to load up the CSV into Pandas to start manipulating the data.
In [2]: df = pd.read_csv('osm_place_lines.csv', low_memory=False)
At this point, I could examine how many columns and rows I have by running:
In [3]: df.shape Out[3]: (611092, 20)
Here we can see that I have 611,092 rows and 20 columns. My original query pulled a lot of columns because I wanted to try to capture as many pre-translated English names as I could. To see what all of the column names are, I just had to run:
In [10]: df.columns Out[10]: Index(['name', 'alt_name_1_en', 'alt_name_en', 'alt_name_en_2', 'alt_name_en_3', 'alt_name_en_translation', 'en_name', 'gns_n_eng_full_name', 'name_en', 'name_ena', 'name_en1', 'name_en2', 'name_en3', 'name_en4', 'name_en5', 'name_en6', 'nam_en', 'nat_name_en', 'official_name_en', 'place_name_en'], dtype='object')
The first task I then wanted to do was drop any rows that had no values in them. In Pandas, empty cells default to the NaN value. So to drop all the empty rows, I just had to run:
In [4]: df = df.dropna(how='all')
To see how many rows fell out, I again checked the shape of the data.
In [5]: df.shape Out[5]: (259564, 20)
Here we can see that the CSV had 351,528 empty rows where the line had no name or English name translations.
Next, I assumed that I had some duplicates in the data. Some things in OSM get generic names, so these can be filtered out since I only want the first row from each duplicate. With no options, drop_duplicates() in Pandas only keeps the first value.
In [6]: df = df.drop_duplicates()
Checking the shape again, I can see that I had 68,131 rows of duplicated data.
In [7]: df.shape Out[7]: (191433, 20)
At this point I was interested in how many cells in each row still contained no data. The CSV was already sparse since I converted each hstore key into a separate column in my output. To do this, I ran:
In [8]: df.isna().sum() Out[8]: name 188 alt_name_1_en 191432 alt_name_en 190310 alt_name_en_2 191432 alt_name_en_3 191432 alt_name_en_translation 191432 en_name 191430 gns_n_eng_full_name 191432 name_en 191430 name_ena 172805 name_en1 191409 name_en2 191423 name_en3 191429 name_en4 191430 name_en5 191432 name_en6 191432 nam_en 191432 nat_name_en 191431 official_name_en 191427 place_name_en 191429 dtype: int64
Here we can see the sparseness of the data. Considering I am now down to 191,433 columns, some of the columns only have a single entry in them. We can also see that I am probably not going to have a lot of English translations to work with.
At this point I wanted to save the modified dataset so I would not loose it. This was a simple
In [8]: df.to_csv('osm_place_lines_nonull.csv', index=False)
The index=False option tells Pandas to not output its internal index field to the CSV.
Now I was curious what things looked like, so I decided to check out the name column. First I increased some default values in Pandas because I did not want it to abbreviate rows or columns.
pd.set_option('display.max_rows', 200) pd.set_option('display.max_columns', 25)
To view the whole row where the value in a specific column is null, I did the following and I will abbreviate the output to keep the blog shorter 🙂
df[df['name'].isnull()] ... name_en name_ena name_en1 name_en2 \ 166 NaN Orlovskogo Island NaN NaN 129815 NaN Puukii Island NaN NaN 159327 NaN Ometepe Island NaN NaN 162420 NaN Tortuga NaN NaN 164834 NaN Jack Adan Island NaN NaN 191664 NaN Hay Felistine NaN NaN 193854 NaN Alborán Island Military Base NaN NaN 197893 NaN Carabelos Island NaN NaN 219472 NaN Little Fastnet NaN NaN 219473 NaN Fastnet Rock NaN NaN 220004 NaN Doonmanus Rock NaN NaN 220945 NaN Tootoge Rock NaN NaN 229446 NaN Achallader NaN NaN 238355 NaN Ulwile Island NaN NaN 238368 NaN Mvuna Island NaN NaN 238369 NaN Lupita Island NaN NaN 238370 NaN Mvuna Rocks NaN NaN 259080 NaN Kafouri NaN NaN 259235 NaN Al Thawra 8 NaN NaN 259256 NaN Beit al-Mal NaN NaN 261584 NaN Al Fao NaN NaN 262200 NaN May 1st NaN NaN ...
Now that I have an idea how things look, I can do things like fill out the rest of the name columns with the English names found the various other columns.