As a data analyst or data scientist, it is a must for you to know how different dataframe can be joined together to make our data manipulation easier. this article will walk you through the process with different types of joins we have in pandas. for a reader to understand how it works you should have the basics understanding of data science, if you are just a beginner in the field, you can start learning here
what is pandas?
Pandas is defined as an open-source library that provides high-performance data manipulation in Python. The name of Pandas is derived from the word Panel Data, which means an Econometrics from Multidimensional data. It is used for data analysis in Python and was developed by Wes McKinney in 2008. Data analysis requires lots of processing, such as restructuring, cleaning or merging, etc. There are different tools are available for fast data processing, such as Numpy, Scipy, Cython, and Panda. But we prefer Pandas because working with Pandas is fast, simple and more expressive than other tools. read more here. to make use of pandas in your project, it must be imported first, common way to do this is to import it as an alias named pd. It’s always written as import pandas as pd.
The pandas package is a powerful tool for manipulating and transforming data in Python. However, when working on an analysis, the data needed could be in multiple tables. for clarifications, I believed as an intermediate learner of data science, you should know how to import or read tabular data into dataframe. if you still have issues with this, click here. the process of combining different tables is called merging or joining tables. I will be referring to any of the terms with the same meaning throughout the articles.
we will be making use of datasets names pulls_one and pulls_two respectively. whereas both contain the name of Data scientist and the number of contributions they made on a particular scalar language on Github, both datasets are imported already, as you can see below:
Here, pull_one and pulls_two are imported and the first five rows are printed so you know the arrangement of the datasets. you can have the datasets here
The pandas package has an excellent DataFrame method for performing this type of merge called .merge(). The merge method takes the first DataFrame, pulls_one, and merges it with the second DataFrame, pulls_two. We use the on argument to tell the method that we want to merge the two DataFrames on the user column. Since we listed the pulls_ones table first, its columns will appear first in the output, followed by the columns from the pulls_two table. This is called an inner join. An inner join will only return rows that have matching values in both tables.
The two tables are related by their user column. We can join them together, matching the user name from each row of the pulls_one table to the user’s name from the pulls_two table. you can see below:
You may have noticed that the merged table has columns with suffixes of underscore x or y. This is because both the pulls_one and pulls_two tables contained pid and date columns. To avoid multiple columns with the same name, they are automatically given a suffix by the .merge method.
We can use the suffix argument of the merge method to control this behavior. We provide a tuple where all of the overlapping columns in the left table are given the suffix ‘_p_one’, and those in the right table will be given the suffix ‘_p_two’. This makes it easier for us to tell the difference between the columns. check below to see how it works:
Thanks for reading. I believe after reading the article, you should be able to merge any dataframe with inner join as well as the arguments it takes. there are other types of joins: left, right, and outer. you can learn more about them here