Introduction
Pandas is a powerful open-source data manipulation and analysis library for Python. It provides easy-to-use data structures and data analysis tools, making it an essential tool for data scientists and analysts. In this tutorial, we will cover the basics of Pandas, including how to install it, create data frames, perform common operations, and more.
Prerequisites
Before we get started, make sure you have Python installed on your system. You can download Python from python.org if you haven’t already. Additionally, you’ll need to install Pandas, which can be done using pip, Python’s package installer.
You can install Pandas by running the following command in your terminal or command prompt:
pip install pandas
Once Pandas is installed, you’re ready to start using it!
Importing Pandas
In order to use Pandas in your Python script or notebook, you need to import it. It is common to import Pandas with the alias pd
:
import pandas as pd
Creating a DataFrame
The primary data structure in Pandas is the DataFrame
, which is a two-dimensional table with rows and columns. You can create a DataFrame from various data types such as lists, dictionaries, or even from external files like CSVs or Excel sheets.
Creating from Lists
You can create a DataFrame from a list of lists. Each inner list represents a row in the DataFrame.
import pandas as pd
data = [
['Alice', 25, 'Female'],
['Bob', 30, 'Male'],
['Charlie', 35, 'Male']
]
# Define column names
columns = ['Name', 'Age', 'Gender']
# Create the DataFrame
df = pd.DataFrame(data, columns=columns)
print(df)
Output:
Name Age Gender
0 Alice 25 Female
1 Bob 30 Male
2 Charlie 35 Male
Creating from a Dictionary
You can also create a DataFrame from a dictionary where keys are column names and values are lists representing column data.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Gender': ['Female', 'Male', 'Male']
}
# Create the DataFrame
df = pd.DataFrame(data)
print(df)
Output:
Name Age Gender
0 Alice 25 Female
1 Bob 30 Male
2 Charlie 35 Male
Reading from a CSV File
Pandas makes it easy to read data from external files. For example, to read from a CSV file, you can use pd.read_csv()
:
import pandas as pd
# Read CSV file into a DataFrame
df = pd.read_csv('data.csv')
print(df)
Basic DataFrame Operations
Once you have a DataFrame, you can perform various operations on it.
Viewing Data
To view the first few rows of a DataFrame, you can use head()
:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Gender': ['Female', 'Male', 'Male']
}
df = pd.DataFrame(data)
# Display the first 2 rows
print(df.head(2))
Output:
Name Age Gender
0 Alice 25 Female
1 Bob 30 Male
Accessing Columns
You can access columns of a DataFrame using their names:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Gender': ['Female', 'Male', 'Male']
}
df = pd.DataFrame(data)
# Access the 'Name' column
print(df['Name'])
Output:
0 Alice
1 Bob
2 Charlie
Name: Name, dtype: object
Filtering Data
You can filter rows based on certain conditions:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Gender': ['Female', 'Male', 'Male']
}
df = pd.DataFrame(data)
# Filter rows where Age is greater than 25
filtered_df = df[df['Age'] > 25]
print(filtered_df)
Output:
Name Age Gender
1 Bob 30 Male
2 Charlie 35 Male
Adding a New Column
You can add a new column to the DataFrame:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Gender': ['Female', 'Male', 'Male']
}
df = pd.DataFrame(data)
# Add a new column 'City'
df['City'] = ['New York', 'Los Angeles', 'Chicago']
print(df)
Output:
Name Age Gender City
0 Alice 25 Female New York
1 Bob 30 Male Los Angeles
2 Charlie 35 Male Chicago
Deleting a Column
You can delete a column using the drop()
method:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Gender': ['Female', 'Male', 'Male']
}
df = pd.DataFrame(data)
# Drop the 'Gender' column
df = df.drop('Gender', axis=1)
print(df)
Output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
Summary Statistics
Pandas provides easy ways to calculate summary statistics on your data.
Descriptive Statistics
You can use the describe()
method to get a summary of descriptive statistics:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Gender': ['Female', 'Male', 'Male']
}
df = pd.DataFrame(data)
# Get descriptive statistics
print(df.describe())
Output:
Age
count 3.000000
mean 30.000000
std 5.773503
min 25.000000
25% 27.500000
50% 30.000000
75% 32.500000
max 35.000000
GroupBy Operations
You can use groupby()
to group data and then perform operations on the groups:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David',
'Eve'],
'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Engineering'],
'Salary': [60000, 80000, 70000, 75000, 90000]
}
df = pd.DataFrame(data)
# Group by 'Department' and calculate average salary
avg_salary = df.groupby('Department')['Salary'].mean()
print(avg_salary)
Output:
Department
Engineering 81666.666667
HR 65000.000000
Name: Salary, dtype: float64
Conclusion
Pandas is a powerful library that provides easy-to-use data structures and tools for data analysis in Python. In this tutorial, we covered the basics of Pandas, including creating DataFrames, performing common operations, and calculating summary statistics. This should give you a good foundation to start exploring and analyzing your own datasets using Pandas!