Visualizing data from PDFs

I recently had the opportunity to work on a fun project. It was all about extracting data from PDFs and generating visualizations from the data that was extracted. I also needed to create a custom database, with all of the proper tables and indexes, for the data to be stored in. This was so the end product could allow easier criteria selection for displaying different graphs, i.e. compare different years of data sets. I have always had a lot of fun working with raw data, so I looked forward to this project.

The client for this project needed me to help them extract hunting results from all of the Utah big game hunts. Then present this data in clean graphs, allowing easy traversal of the different years and species that the state of Utah provides. The catch is this data is provided from the state of Utah in PDFs, not even easy HTML tables to scrape the data from. Follow along as I talk about the process that I used to go from a PDF to a website that displays the data in nice charts.

I started my career as a software engineer, even though a project like this falls into the data engineering/data science world. Because of my background, I target a certain set of workflows. Even though I wanted to get the data extracted from the PDFs, and found that CSV files were a good output from the documents, I wanted to get this data into a database. More on this will come up later.

The Data

The data is all provided at Utah Harvest Report.

A screenshot of the data set that I'd be working with.

Fortunately for me the PDFs they provide are clean tabular data sets that should be easy enough to extract once I have a good process of doing so. I don't know about you, but one of the last things that I want to do is copy and paste all of these columns by hand into a spreadsheet.

Extracting the data

Before I found a good reliable solution, I tried a couple of things that just didn't work. But that is all part of the process, trial and error until you get it figured out. There are so many ways to accomplish a task in the computing world, so it makes sense to pick one that works for you specific project.

Copy and paste

Like most people I didn't want to do more work that I needed to, to start with. In hopes that the solution would be simple, I just tried a good old copy and paste. And, as you can see, that didn't work out at all.

DB1500  BeaverArchery2261355416329.53.8DB1501  Box ElderArchery18642553011.83.8DB1502  CacheArchery2125097915716.03.8DB1503  Central Mtns, Manti/San RafaelArchery12/16B1574135426719.73.7DB1504  Central Mtns, NeboArchery16A77761516226.43.6DB1505  Chalk Creek/East Canyon/Morgan-South Rich  Archery4/5/61757130416612.83.7DB1506  FillmoreArchery2145739510225.93.9

Instead of having a great comma (or even tab) separated file, I ended up with out any line brakes, and no real spaces between values. So, the simple solution, in this case turned out to be a bust.

But in general I always try the simple solution, because, on occasion I can get a solution that works and doesn't take any extra time. When that happens it is a huge win. When it doesn't happen though, I feel I get a deeper immersion in the problem set and make good progress on finding a solution.

Going too far

Like a programmer, my next thought was to reinvent the wheel. I started digging into the different raw PDF libraries available for Java, Python, and Go to name a few. I've worked with some PDf libraries in the past to generate PDFs, but hadn't really had any experience pulling data from them.

I figured that I could find a library that would give raw access to the data contained within the document. Then work my way to identifying the lines that contained the data of interest. After enough trial and error I figured that I could write a specific PDF parser that would work for the different species and years.

But the problem turned out to be even more complex than that. There is a lot one needs to know about fonts, font sizing, and drawing those fonts on screen. I needed to make progress on the problem, which is visualizing this data set, not studying the PDF spec and trying to reinvent the wheel.

Excalibur and Camelot

Two options have failed, so it was time to roll up the research sleeves. After more digging around I found a couple of tools. Excalibur and camelot, generic solutions for extracting tables from PDFs. Camelot is the underlying tool that extracts the data, where excalibur provides a nice web based user interface for managing the documents and data to be extracted.

As this isn't a tutorial on using excalibur or camelot, but rather a discussion about the process of this project, check the blog for additional information.

After uploading the document and selecting the pages that I want to extract the data from, I'm taken to a form where I can select the tables I want to work with.

Extracting data from a pdf

After specifying the data that I want I can easily start the extraction process. After the extraction is complete I can preview the data and select the output type. In this case I chose CSV.

"DB1500","Beaver","Archery","22","613","554","163","29.5","3.8"
"DB1501","Box Elder","Archery","1","864","255","30","11.8","3.8"
"DB1502","Cache","Archery","2","1250","979","157","16.0","3.8"
"DB1503","Central Mtns, Manti/San Rafael","Archery","12/16B","1574","1354","267","19.7","3.7"
"DB1504","Central Mtns, Nebo","Archery","16A","777","615","162","26.4","3.6"
"DB1505","Chalk Creek/East Canyon/Morgan-South Rich","Archery","4/5/6","1757","1304","166","12.8","3.7"
"DB1506","Fillmore","Archery","21","457","395","102","25.9","3.9"
"DB1508","Kamas","Archery","7","750","611","72","11.7","3.6"
"DB1509","La Sal, La Sal Mtns","Archery","13A","308","254","97","38.3","3.9"

Choosing the database

If all you need to do is visualize a few numbers from your PDF, building a chart in Excel probably isn't a bad thing; and could be done. However this project, as stated above, is bigger than that. There are multiple data sets that need to be shared with a lot of people. So a database backing a web application is the go to for this project. It is important to remember that you need to understand the goals of your project, as it informs the decisions that you make along the way.

There are a lot of options out there for databases. At the highest leve of distinction there are relational and nosql databases to chose from. This project requires a bunch of different queries; different years, different hunt types, and the differences between limited entry and general entry (it is ok if you don't understand that, it is hunting based). Because of these different data sets, it is best to use a relational data store. This will allow indexes to be created for the important columns within the data and query based upon a varied set of these columns.

Setting up the database indexes

To get started with indexes, you need to understand your data access patterns (or what types of questions you want to ask your data). Once you know what questions you want to ask, you can instruct your database to keep track of the values in specific columns. This will allow your queries to run faster, at the cost of higher data storage.

One other useful aspect of indexes in a database is proving uniqueness of a dataset. I used this aspect of indexes so that I could only import one data set for a hunt number (specified by the state), the year the hunt took place, the hunt type (General Entry, Limited Entry), and the species (elk, deer, etc.). I wanted to be able to reprocess the dataset if needed and not duplicate the data, that was already imported.

Using postgresql

In the end I decided on using postgresql. Which ended up being less of a technical requirement for this project as it was an issue of using sqlite. I wanted to deploy the database along with the binary deployment. But I was having issues with a modern glibc on arch building the sqlite library not playing nice with an older glibc on the production machine of ubuntu. I may address this a bit later using a docker container to build the application, but that will be something for down the road.

Importing the CSV data into the database

There are a plethora of options for importing data, this is probably one of the main areas that has the largest number of choices. How then, do you decide the process you follow for managing your data? This is where a deeper understanding of the different database platforms, data access patterns, and most importantly what you want out of the data is imperative to make these decisions.

Raw CSV import

At first glance, why not just use the database specific tool for importing a csv? There are tools out there like phpmyadmin, dbeaver, mysql workbench, and DataGrip. You could just select the file that you want to insert, specify that it is a csv and hit go. After the data was imported you would be able to start querying it. Again, sometimes this isn't a bad solution, but other times you want more fine grained control over the data that is going into your database. Maybe you want to clean up some values, i.e. normalize abbreviations. Or perhaps you want to reject certain records if they don't meet the requirements of your project. Because of these two considerations, I decided to use a custom importer.

Writing a custom importer

OK, so how does someone get started writing a custom importer? Maybe you're curious of what that even means. The high level overview is this. There are a lot of programming languages out there. Many of these languages have libraries that allow them to talk to databases. You pick a language that you like and that you're proficient in or you'd like to learn better. Then make sure that it has a library available to interface with the database management system that you've selected. And you get started writing some code.

Which language did I pick to use for my import? I chose go. It is one of my favorites actually. It is a very simple language to wrap your head around, has excellent libraries for working with CSV files and databases, and it is much easier to deploy based upon it's static and small binaries. I also like to write my web services in go, so I'd be creating my data models with that in mind. Why not write them to work with the importer and the web service at the same time?

My import process

I'm going to go through a simplified version of my import process here. As it is one that I have used on a number of projects and find it really helps me with validating and cleaning up data.

Because this is go, I start off with a struct I want my imported data to go into.

type Hunt struct {
   gorm.Model
   HuntNumber string `gorm:"unique_index:number_year_type_species"`
   HuntYear int `gorm:"index:hunt_year;unique_index:number_year_type_species"`
   HuntName string
   HuntType string `gorm:"index:hunt_type;unique_index:number_year_type_species"`
   Permits int
   HuntersAfield int
   Harvest int
   SuccessRate float64
   HunterSatisfaction float64
}

This struct closely maps the columns of the csv. It isn't my final struct that I used, we'll talk about that it a bit.

The next step is to open the csv file and read in the records. I have a main program that will open up the file and create a csv reader:

in, err := os.Open("hunts.csv")
cr := csv.NewReader(in)

Now that I have the csv reader setup and ready to go I can start reading in the records.

for {
    records, err := cr.Read() // This will read one record from the file and return a string array to be used.
    if err == io.EOF {
        break // We're done here, so lets leave the for loop
    }

    if hunt, err := LoadGeneralEntry(records); err == nil {
        // Append this hunt to the results set, or add to database.
    }
}

The next step is to understand LoadGeneralEntry. This method is simplified from what exists in the wild, but we can cover the importance of using a staticly typed language.

func LoadGeneralEntry(data []string) (Hunt, error) {
    h := Hunt() // Initialize the return object
    h.HuntNumber = data[0]
    h.HuntName = data[1]
    h.HuntType = data[2]
    // Skipping some properties for brevity

    if permits, err := strconv.Atoi(data[4]); err == nil { // Using go's atoi method to convert a string to an int.  This is good for specifying data types and verifying this data
        h.Permits = permits
    }

    // Again not handling all properties from the record.

}

Now that we have our hunt parsed out of the csv record, we can save it to the database that was specified above. For this, I'm using the gorm library for go. As you can see on our struct above, we've added gorm struct tags above for our indices to be built when auto migrating our struct.

Visualizing

Full result set for 2018.

I know I said before that importing the data had a lot of options, but visualizing the data is the other area that has so many options. In fact there are so many options for visualizing the data, that I'm only going to discuss some general options, until we get to the solution that was finally settled on.

Render on Desktop

Maybe all that you need to do with the data that you extracted from the PDF is to look at it yourself or just get one graph and share it via email. If that is the case then no worries, create your graph in excel and you're done. Not only is there excel though, you can use a number of charting libraries if you're a bit more technically inclined.

Many languages allow you to read in the data and output a graph to a file. Which this is the first step that I took. I used a plotting library for the go programming language to display this data that I collected.

But, again, for this case it is a bit more complex and the client wanted to be able to share the graphs that from the data on the web.

Server side rendering

This is kind of like using the programming language charting library and generating the images and then simply displaying them on the web page that you have. This also isn't a bad solution, and is acutally a part of a possible caching solution. The challenge with this, depending on the complexity of your data set, is it could be hard to pre-generate all of the different charts and graphs that you'd like to display.

Frontend graph libraries

This is the general solution that I gravitated to. The client wanted users to be able to easily look at multiple years, hunt types, and species worth of data. So the solution that was chosen was to create a restful api that would return the chosen data set to the frontend and allow it to be rendered via javascript charting library. Above is one of the charts that was generated using vuejs and vue-chartjs.

Revisiting the data import

As you can see in the chart above, there are a lot of data points on the x axis. More than I wanted to be able to display at one time. It just makes it challenging to understand all that you're looking at as the user. I knew then that I needed to be able to split the data up upon import. And I didn't want to go through each CSV file and put some extra data into it. So I needed a solution that would scale. Each hunt has a hunt name associated with it. I realized, that I could associate the hunt name with a region based upon the location in the state of Utah. So I came up with this go map.

var RegionMap = map[string]string{
    "Beaver":                         Southern,
    "Box Elder":                      Northern,
    "Cache":                          Northern,
    "Central Mtns, Manti/San Rafael": Central,
    "Central Mtns, Nebo":             Central,
    "Chalk Creek/East Canyon/Morgan-South Rich": Northern,
    "Fillmore":                     Central,
    "Kamas":                        Northern,
    "La Sal, La Sal Mtns":          Central,
    "Monroe":                       Southern,
    "Mt Dutton":                    Southern,
    "Nine Mile":                    Central,
    "North Slope":                  Northern,
    "Ogden":                        Northern,
    "Oquirrh-Stansbury":            Northern,
    "Panguitch Lake":               Southern,
    "Pine Valley":                  Southern,
    "Plateau, Boulder/Kaiparowits": Southern,
    "Plateau, Fishlake":            Southern,
    "Plateau, Thousand Lakes":      Southern,
    "San Juan, Abajo":              Southern,
    "South Slope, Bonanza/Vernal":  Central,
    "South Slope, Yellowstone":     Central,
    "Southwest Desert":             Central,
    "Wasatch Mtns, East":           Central,
    "Wasatch Mtns, West":           Northern,
    "West Desert, Tintic":          Central,
    "West Desert, West":            Central,
    "Zion":                         Southern,
}

During the import process, where the hunt itself is being parsed from the string array, I could easily check the hunt name in this map. (If you don't know what a map it, it is a simple way to associate the data key on the left to a data value on the right.) I added a region name property to my hunt struct and populated it with the value at import time. Now I can make queries to the database, using the region as a way to minimize the values in the response. Below is what I now am able to display.

Northern utah big game hunts for 2018.

Grateful for the custom importer

Because of adding the region that I mentioned above, this is why I'm so glad that I went with a custom coded importer. Rather than have a data frame that is just a generic holder for the data, and dumping it one place or another; I could easily inspect that data that was being parsed and augment that data with more information.

Using programming to visualize important data sets

Data is what drives the world we live in today. Each of us has the need of understanding something better. It isn't always the easiest thing to figure out at times, to get our data in the way that makes the best sense to us. Often times we reach for an off the shelf solution to this problem. This off the shelf solution might be helpful for a piece of the problem, but often times doesn't get us all the way there. That is where knowing a general purpose programming language can get you all the way there.