Data Science

Wrangling Patient Data with OpenPyXL and ElasticSearch

A not-so-simple process, explained.

Luis Bauza

--

For most data science applications, Excel spreadsheets are a useful tool to help visualize and interpret data. When dealing with patient data in a medical setting however, things start to get a little more complicated. “How can I get my patient data (that is stored in a device vendor API) formatted and displayed nicely in an Excel format?” you might be asking yourself. As simple as it may seem to just grab the data and pop it into a .xlsx file using Python, it’s slightly more complicated than it may seem at the surface. Let’s dive in.

If we have our data stored in an ElasticSearch instance (which I recommend for convenient data searching and visualization thanks to Kibana), it is perfectly understandable to simply for loop through your data and push it into an Excel workbook instance using OpenPyXL. The issue with this approach is that the data is:

  • Unformatted
  • Difficult to decipher
  • Cut off at each cell
  • Incompatible for most data visualization requirements

In order to remedy these issues and many more that are a result of no formatting being applied to the spreadsheet, automatic formatting using OpenPyXL is a must. But just how can we begin adding the data to the spreadsheet in the first place? It’s not as simple as say:

Don’t you wish you could do something as simple as this?

If we’re looking at an implementation using ElasticSearch, the first thing that we need to do is connect to the database and pull the necessary data.

To do this, all we need is the URL of the ElasticSearch server along with the username, password, and location where the documents are stored. Along with this, we also have headers that can be added to the POST request used to add data to the server as a documet. Something else that ElasticSearch requires is a date range for getting data.

A simple query (formatted as a dictionary) used to get patient data.

It’s helpful for narrowing requests down to the specific dates that data should be pulled from.

Once we have all of this, we can then create a query object that contains specifiers for the data that we want to pull. Using this query, the reply is received as a JSON response object from the server and within this reply is where we can filter out unnecessary or extraneous data from the returned data.

Query object instantiated, data filtering by timestamp and key name.

If desired, it’s possible to create a dynamic header for the data to be sorted by when using Excel. One possibility is having a header with titles for the categories of data below combined with dates (if the data is time sensitive). These headers are usually simple to generate unless the data is time sensitive. When dealing with this type of data, we must add all possible dates across all of our data to the header and generate it dynamically by pulling this date information from ElasticSearch. Although the efficiency here isn’t as great as it would be by assuming dates, it ensures data integrity and accuracy.

The act of adding data itself to the OpenPyXL spreadsheet is pretty simple:

Guided example of adding data to an OpenPyXL Worksheet (within a Workbook).

Now, once we have the data in our spreadsheet we can do the fun part: formatting! One way of formatting patient data is to sort the data within the spreadsheet. We can use a combination of pandas and numpy to achieve this:

Formatting the data may seem daunting, but these libraries make it simple.

Another, more simple way of giving the data some more customization is to add a background to each row of cells. The background in the following example alternates between gray and white to differentiate data between different rows.

OpenPyXL Worksheets include functions that iterate through columns and allow formatting.

Throughout all of these examples, the Worksheet has been modified within memory. It isn’t until a filename is specified and wb.save(sheet_name) is called that the Workbook is saved to a file within the current directory.

That’s pretty much it! As you might have been able to tell, OpenPyXL is a great tool for adding data and formatting your own Excel spreadsheet. It helps out with data science applications and with the visualization of data. We can take data visualization a step further by creating PDFs using FPDF, matplotlib, and our data to create charts and graphs for a more historical representation.

--

--

Luis Bauza

Software Engineer at Perlogix; FIU Computer Science 2020 Graduate