Analysis: Pivot Tables

I have been playing around with the pivot_table function in the Python package, pandas.  The reason for this is because I was using groupby and wanted to see the difference in functionality and how it could impact my work.

Using the pivot_table function is a lot like PivotTables that you may have used or seen in Excel where you can reshape data and quickly apply indexes and columns to perform analysis.  One thing to note before starting is that pivot_table is one  Python function that is constructed of arguments and lists. pivot_table is just one way to do analysis on data which I find to be the fun part in learning.

In my example, I explored pivot_table through a fictitious data set on gift officer data.  This sample only scrapes the surface of this function but I hope you enjoy!

Questions that directed my experimentation with pivot_table.

  1. How many gift officers are in this sample?
  2. How can I see who is assigned to each manager?
  3. How do I aggregate information for each manager?
  4. How do I filter on specific information, like accepted proposals>?

 

As I started up my project, I loaded in my DataFrame and reviewed all the data.

manager_head

After looking at my data, I wanted to see how many managers were actually in the file. To start looking at manager information, I identified my index to be Manager data. Note that in this section, Ask columns look strange and do not total the correct amounts.  Also note that the table is created referencing the DataFrame (df), and the function is under the pandas namespace and so begins with pd., rather than being a method of a DataFrame (like df.groupby).  Due to muscle memory, I make the mistake of beginning to type df.pivot_table.

manager_reference

My next step was to then look at the ID information to see what IDs were linked to specific managers.  This helps me see that record 4413 has worked with three of the four managers in the sample.  The actual values in the data set are incorrect because the default of pivot_table is to report the mean. The returned value can be specified to return sums and lengths.

manager_id

My next step was to add in Columns, or the section that data is compared.  I specified the column to be Status while keeping the index the same. As you can see, this yielded an error because Status does not contain numeric data.  The important thing here to remember is that errors are normal, especially when you are practicing new skills or constructing your argument differently.  The great thing about errors is that you can learn why they occur and how to modify your code.errorerror2

 

Still wanting to check out the Status and learning that only numeric data can be applied to the columns, I rearranged my variables and added Ask_Amount, Actual_Ask and Accepted_Ask.  Note here that in the output, the columns are arranged alphabetically.

manager_status_accept

After looking at the last table, I determined that I did not care to see the IDs anymore in the data.  That was because I felt that I did not need to know information on a record by record level, but rather in aggregate.  To show the sum for each manager I applied the argument aggfunc and set the value to return the sum.  What a world a a difference that makes, huh?

aggfun

My final thought in working through this data set was about filtering.  What if I wanted to filter on a specific value such as accepted ask amounts only?  My first step was define a new set of data.  I did that by making a variable called Officer_table.  You will see that all the code after the equals sign is the pivot_table function previously used before. Then I used .query and specified that I only want look ask information if the Status was Accepted.  To do that I used the operator ==, make sure you check out other Boolean operations and comparisons if you want to get into specifying data in your work.

filtering

 

Hope you enjoyed this quick overview of pivot_table!  You can learn more about them  in the pandas documentation.

 

 

Continue Reading

Formatting: Leading Zeros

How many times have you exported data and realized a little too late that the leading zeros in constituent IDs were no longer showing up?

I know I have, and much to my chagrin I have had to re-send the file or spend time working through a process of making Excel or CSV format the leading zero appropriately.  I found a way using the Python package, pandas that makes leading zero formatting fairly easy and you don’t have to  add zeros or concat two cells together like you may consider doing in using Excel.

Below is an example data and the the steps that I followed to when I found that zeros where disappearing in my Excel document.

Here is my “data” with the IDs formatted as text so the leading zeros are visible.  When I saved this workbook, I saved it as an Excel file.excel_sample

I then imported pandas in my Jupyter Notebook, as I would normally when loading a dataframe for work and before preparing a file to be used in SPSS.  I wanted to see the first few lines of the file using the DataFrame.head() method and saw that the leading zeros had been dropped.

pd_read_excel

After seeing that the leading zeros were not showing I checked the Excel file to confirm that the IDs were saved as text.  Unfortunately, the ID column was saved as text, so there was nothing to alter there.

My next attempt at resolving the issue involved an reading an Excel file, and attempting to change the ID format to string.  Regrettably that lead to an error that ended with 'The 'dtype' option is not supported with the 'python' engine.

 

excel_error

excel_error_code

I then read documentation about the method of reading CSVs in pandas.  From there, I saw that my earlier code, the object defining the datatype (dtype), was meant to be used with with the function read_csv().  Excited, I saved my example data as a CSV, and reloaded the dataframe and applied the datatype object.  In the example below, you can see that once the dtype is applied, the leading zeros are visible in the dataset. read_csv

After getting past the initial hurdle of finding out how I can format specific documents I found this object pretty quick and efficient. How have you found the best way to work with leading zeros, especially if you can only export into Excel?

Continue Reading