introduction to python pandas for beginners
Technical Content Writer at almaBetter
The most popular libraries for data science in Python are NumPy, pandas, matplotlib, and seaborn. These libraries are used for a variety of tasks, including data manipulation, data analysis, machine learning, and visualization.
In this blog from AlmaBetter’s team we are going to learn the basics of the pandas library in Python programming language, so let’s get started.
Pandas is a Python library for data analysis. It provides a high-performance data structure called a DataFrame, which is similar to a table in a relational database. Pandas also provides a set of tools for working with data, including a powerful data manipulation tool called groupby.
It is often used in conjunction with other Python libraries such as matplotlib and seaborn for data visualization.
They are two ways to install pandas onto your PC:
Once Anaconda is installed, you can install pandas using the Conda package manager.
Open a terminal and type:
conda install pandas
Pandas should now be installed.
pip install pandas
Now that pandas is installed, let’s learn a bit of the basics about how to use the package to handle real-world data. In this blog we will use a Titanic dataset. Specifically, we will use the train.csv file.
This file includes the name of the passenger, passenger ID, gender, age, ticket ID, fare, cabin number, and information about the number of passengers who survived.
As mentioned previously, pandas allows us to convert data from different formats, such as a CSV file, into a DataFrame object. A DataFrame is a data structure we use quite often in pandas that serves as a tabular representation of data. Think of a DataFrame as an Excel spreadsheet or database table.
We begin by importing pandas, conventionally aliased as pd. We can then import a CSV file as a DataFrame using the pd.read_csv() function, which takes in the path of the file you want to import. To view the DataFrame in a Colab Notebook, we simply type the name of the variable.
#import pandas package import pandas as pd #read the Titanic training csv file df=pd.read_csv("train.csv") #display the pandas DataFrame display(df)
Since there are so many rows in the DataFrame, we see that most of the data is truncated. We can view just the first or last few entries in the DataFrame using the .head() and .tail() methods.
Typically, we will only want a subset of the available columns in our DataFrame. We can select a single column using single brackets and the name of the column as shown below.
#Results for a single column df['Name']
The result is a Series object with its own set of attributes and methods. These objects are like arrays and are the building blocks of DataFrames; each DataFrame is made up of a set of Series.
To select multiple columns at once, we use double brackets and commas between column names as shown below.
#results for multiple columns passengers= df[['PassengerId','Name']] passengers.head()
The result is a new DataFrame object with the selected columns. It is useful to select the columns you are interested in analyzing before moving on to the analysis, especially if the data is large with many unnecessary variables.
The pandas library is a powerful tool for data analysis and manipulation. It is easy to use and has a wide variety of features that make it a valuable tool for any data scientist. In our upcoming blog, we will talk about “pgSQL - The right way to get started”.
To become a skilled data scientist, join our Full Stack Data Science program.