At times as a Data Scientist, we are going to encounter poor quality data. To be successful we need to be able to effectively manage data quality issues before any analysis. Thankfully there are several powerful open-source libraries that we can utilise to efficiently process data such as Pandas. Today we are going to look at the different ways that we can loop over a DataFrame and access its values. Iterating a DataFrame can be incorporated in steps post initial exploratory data analysis to begin cleansing raw data.
For those of you that are new to data science or unfamiliar…
As a Data Analyst or Data Scientist, you will frequently have to combine and analyse data from various data sources. A data type I commonly get requested to analyse is CSV files. CSV files are popular within the corporate world as they can handle powerful calculations, are easy to use and are often the output type for corporate systems. Today we will demonstrate how to use Python and Pandas to open and read a CSV file on your local machine.
You can install Panda via pip from PyPI. If this is your first time installing Python packages, please refer to…
Last article we shared an embarrassing moment which encouraged us to learn and use Pandas to pivot a DataFrame. Today we are going to look at Pandas built-it .melt()
function to reverse our pivoted data. The .melt()
function comes in handy when you need to reshape or unpivot a DataFrame.
Before ripping in, if you’re yet to read Pivoting a Pandas DataFrame or haven’t been exposed to Python Pandas previously, we recommend first beginning with Pandas Series & DataFrame Explained or Python Pandas Iterating a DataFrame. …
One of the funniest moments you will have as a Data Analyst or Developer is coming across code you wrote as a junior. This moment recently happened when a stakeholder requested an update of an extract I provided when I first came on board. The request was to analyse casual hours worked during each calendar year for current staff members. Embarrassingly, my naive approach was to create columns using subqueries within the main select statement to segregated the years. Whilst this approach worked, the overall performance of the query was terrible.
My initial approach meant aggregating casual hours worked between…
Being able to extract, transform and output data is a crucial skill to develop to be a successful Data Analyst. Today we are going to look at how to use Pandas, Python and XlsxWriter to output multiple DataFrames to the one Excel file. We are going to explore outputting two DataFrames vertically and horizontally to the same tab, then splitting the two DataFrames across two tabs.
Today’s story builds on what we covered in How to Combine Pandas, Python & XlsxWriter, where we looked at outputting your first DataFrame to Excel. …
Dropping duplicates from your data sets is a task you will regularly have to do as a Data Analyst. Whilst in some cases, duplicates may be valid frequently, they have been created through lax data integrity or incorrect joining methods during data extraction. To be successful as a Data Analyst, you need to be able to identify effectively invalid duplicates and remove them from your data sets. Not removing these duplicates will affect the quality of your data analysis. Today's story will form a guide that you can refer back to when needing to identify and remove duplicates.
This story…
Being able to extract data from a database and output it to Excel is a crucial skill to have as a Data Analyst. Today we are going to look at using a Python package called XlsxWriter to output a small DataFrame to Excel. XlsxWriter is a powerful package that you can use to auto-format Excel worksheets, change the styling and insert objects such as tables.
There are several ways that you can install XlsxWriter. The easiest method would be using the Python package installer called pip. Open up a terminal and run the command pip install xlsxwriter
.
To test that…
Today I am going to share with you several tools, packages and code snippets that I have used and developed during my time as a Data Analyst. In our roles as Data Analysts, there are going to be times where you are required to rerun the same report, run a similar report with different parameters or apply the same statistical analysis over differing datasets. Below I will give you a brief overview of some of the tools that you will be able to incorporate into your workflow as a Data Analyst to increase your productivity.
As Data Scientist, we will often find that we are required to analyse data from multiple data sources at the one time. To be successful at achieving this, we need to be able to merge different data sources using a variety of methods efficiently. Today we are going to look at using Pandas built-in .merge()
function to join two data sources using several different join methods.
For those of you that are new to data science or haven’t been exposed to Python Pandas yet, we recommend first beginning with Pandas Series & DataFrame Explained or Python Pandas Iterating a DataFrame…
To be successful as a Data Scientist one needs to be continuously learning and improving our skills across a wide range of tools. A tool synonymous with Data Science these days is Pandas. Pandas is an incredibly powerful open-source library written in Python. It offers a diverse set of tools that we as Data Scientist can use to clean, manipulate and analyse data. Today we are beginning with the fundamentals and learning two of the most common data structures in Pandas the Series and DataFrame.
Pandas and Numpy are open-source libraries written in Python that you can utilise in your…