Bytes

home

bytes

articles

introduction to python pandas for beginners

Data Science

Introduction to Python Pandas for Beginners

icon

Narender Ravulakollu

Technical Content Writer at almaBetter

people4 mins

people3805

Published on16 May, 2023

The most popular libraries for data science in Python are NumPy, pandas, matplotlib, and seaborn. These libraries are used for a variety of tasks, including data manipulation, data analysis, machine learning, and visualization.

In this blog from AlmaBetter’s team we are going to learn the basics of the pandas library in Python programming language, so let’s get started.

What is Python pandas?

Pandas is a Python library for data analysis. It provides a high-performance data structure called a DataFrame, which is similar to a table in a relational database. Pandas also provides a set of tools for working with data, including a powerful data manipulation tool called groupby.

It is often used in conjunction with other Python libraries such as matplotlib and seaborn for data visualization.

How to Install pandas

They are two ways to install pandas onto your PC:

  • Anaconda Distribution: First, you need to download and install the Anaconda distribution of Python. Anaconda is a free and easy-to-use environment for scientific Python.

Once Anaconda is installed, you can install pandas using the Conda package manager.

Open a terminal and type:

conda install pandas

Pandas should now be installed.

  • Pip Install: Pip is a package management system that simplifies the process of installing and managing software packages written in Python. To install pandas using pip, just run the following command in your terminal.
pip install pandas

Getting Started with pandas

Now that pandas is installed, let’s learn a bit of the basics about how to use the package to handle real-world data. In this blog we will use a Titanic dataset. Specifically, we will use the train.csv file.

This file includes the name of the passenger, passenger ID, gender, age, ticket ID, fare, cabin number, and information about the number of passengers who survived.

Importing and Viewing Data

As mentioned previously, pandas allows us to convert data from different formats, such as a CSV file, into a DataFrame object. A DataFrame is a data structure we use quite often in pandas that serves as a tabular representation of data. Think of a DataFrame as an Excel spreadsheet or database table.

We begin by importing pandas, conventionally aliased as pd. We can then import a CSV file as a DataFrame using the pd.read_csv() function, which takes in the path of the file you want to import. To view the DataFrame in a Colab Notebook, we simply type the name of the variable.

#import pandas package

import pandas as pd

#read the Titanic training csv file

df=pd.read_csv("train.csv")

#display the pandas DataFrame

display(df)

Screenshot 2022-06-23 181434.png

Since there are so many rows in the DataFrame, we see that most of the data is truncated. We can view just the first or last few entries in the DataFrame using the .head() and .tail() methods.

df.head()

Screenshot 2022-06-23 181645.png

df.tail()

Screenshot 2022-06-23 181645.png

Selecting Columns

Typically, we will only want a subset of the available columns in our DataFrame. We can select a single column using single brackets and the name of the column as shown below.

#Results for a single column

df['Name']

Screenshot 2022-06-23 181905.png

The result is a Series object with its own set of attributes and methods. These objects are like arrays and are the building blocks of DataFrames; each DataFrame is made up of a set of Series.

To select multiple columns at once, we use double brackets and commas between column names as shown below.

#results for multiple columns

passengers= df[['PassengerId','Name']]
passengers.head()

Screenshot 2022-06-23 182053.png

The result is a new DataFrame object with the selected columns. It is useful to select the columns you are interested in analyzing before moving on to the analysis, especially if the data is large with many unnecessary variables.

Conclusion:

The pandas library is a powerful tool for data analysis and manipulation. It is easy to use and has a wide variety of features that make it a valuable tool for any data scientist. In our upcoming blog, we will talk about “pgSQL - The right way to get started”.

To become a skilled data scientist, join our Full Stack Data Science program.

Recommended Courses
Certification in Full Stack Data Science and AI
Course
20,000 people are doing this course
Become a job-ready Data Science professional in 30 weeks. Join the largest tech community in India. Pay only after you get a job above 5 LPA.
Masters in CS: Data Science and Artificial Intelligence
Course
20,000 people are doing this course
Join India's only Pay after placement Master's degree in Data Science. Get an assured job of 5 LPA and above. Accredited by ECTS and globally recognised in EU, US, Canada and 60+ countries.

AlmaBetter’s curriculum is the best curriculum available online. AlmaBetter’s program is engaging, comprehensive, and student-centered. If you are honestly interested in Data Science, you cannot ask for a better platform than AlmaBetter.

avatar
Kamya Malhotra
Statistical Analyst
Fast forward your career in tech with AlmaBetter

Vikash SrivastavaCo-founder & CPTO AlmaBetter

Vikas CTO
AlmaBetter
Made with heartin Bengaluru, India
  • Official Address
  • 4th floor, 133/2, Janardhan Towers, Residency Road, Bengaluru, Karnataka, 560025
  • Communication Address
  • 4th floor, 315 Work Avenue, Siddhivinayak Tower, 152, 1st Cross Rd., 1st Block, Koramangala, Bengaluru, Karnataka, 560034
  • Follow Us
  • facebookinstagramlinkedintwitteryoutubetelegram

© 2023 AlmaBetter