150+ Python/Pyspark Pandas DataFrame Practice Exercises with Solutions Beginner to Expert Level

Example image

As a data analyst, working with tabular data is a fundamental part of your role. Pandas, a popular data manipulation library in Python, offers a powerful tool called the DataFrame to handle and analyze structured data. In this comprehensive guide, we will cover a wide range of exercises that will help you master DataFrame operations using Pandas, including some examples in PySpark.

1. Creating a Simple DataFrame

Let's start by creating a simple DataFrame from scratch. We'll use Pandas to create a DataFrame from a dictionary of data.


import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 22, 28],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

Output:


      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   22    Los Angeles
3    David   28        Chicago

Explanation:

In the example above, we created a DataFrame with three columns: 'Name', 'Age', and 'City'. Each column contains data for different individuals.

2. Viewing Data

You can use various methods to view and inspect your DataFrame.


# Display the first few rows of the DataFrame
print(df.head())

# Display basic statistics about the DataFrame
print(df.describe())

# Display column names
print(df.columns)

Output:


      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   22    Los Angeles
3    David   28        Chicago

            Age
count   4.000000
mean   26.250000
std     3.304038
min    22.000000
25%    24.750000
50%    26.500000
75%    28.000000
max    30.000000

Index(['Name', 'Age', 'City'], dtype='object')

Explanation:

You can use the head() method to view the first few rows of the DataFrame, the describe() method to display basic statistics, and the columns attribute to display column names.

3. Selecting Columns

You can select specific columns from a DataFrame using the column names.


# Select the 'Name' and 'Age' columns
selected_columns = df[['Name', 'Age']]
print(selected_columns)

Output:


      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
3    David   28

Explanation:

In the example above, we selected only the 'Name' and 'Age' columns from the DataFrame.

4. Filtering Data

You can filter rows based on conditions.


# Filter individuals above the age of 25
filtered_data = df[df['Age'] > 25]
print(filtered_data)

Output:


    Name  Age           City
1    Bob   30  San Francisco
3  David   28        Chicago

Explanation:

The example above filters the DataFrame to include only individuals above the age of 25.

5. Sorting Data

You can sort the DataFrame based on a specific column.


# Sort the DataFrame by 'Age' in ascending order
sorted_data = df.sort_values(by='Age')
print(sorted_data)

Output:


      Name  Age           City
2  Charlie   22       New York
0    Alice   25  San Francisco
3    David   28        Chicago
1      Bob   30  San Francisco

Explanation:

The example above sorts the DataFrame based on the 'Age' column in ascending order.

6. Aggregating Data

You can perform aggregation functions like sum, mean, max, and min on DataFrame columns.


# Calculate the mean age
mean_age = df['Age'].mean()
print("Mean Age:", mean_age)

# Calculate the maximum age
max_age = df['Age'].max()
print("Max Age:", max_age)

Output:


Mean Age: 26.25
Max Age: 30

Explanation:

In the example above, we calculated the mean and maximum age from the 'Age' column.

7. Data Transformation: Adding a New Column

You can add a new column to the DataFrame.


# Add a new column 'Salary' with random salary values
import random
df['Salary'] = [random.randint(40000, 90000) for _ in range(len(df))]
print(df)

Output:


       Name  Age           City  Salary
0      John   25        New York   78500
1       Bob   30  San Francisco   62000
2      Mary   22        New York   42000
3     David   28        Chicago   74600

Explanation:

In this example, a new column 'Salary' is added to the DataFrame, and random salary values between 40000 and 90000 are assigned to each row.

8. Data Transformation: Removing a Column

You can remove a column from the DataFrame using the drop method.


# Remove the 'City' column
df_without_city = df.drop('City', axis=1)
print(df_without_city)

Output:


       Name  Age  Salary
0      John   25   78500
1       Bob   30   62000
2      Mary   22   42000
3     David   28   74600

Explanation:

The 'City' column is removed from the DataFrame using the drop method with axis=1.

9. Filtering Data: Select Rows Based on Condition

You can filter the DataFrame to select rows that meet certain conditions.


# Select rows where Age is greater than 25
filtered_df = df[df['Age'] > 25]
print(filtered_df)

Output:


    Name  Age       City  Salary
0   John   25    New York   78500
1    Bob   30  San Francisco   62000
3  David   28    Chicago   74600

Explanation:

The filtered_df contains rows where the 'Age' column value is greater than 25.

10. Aggregation: Calculating Mean Salary

You can calculate the mean salary of the employees.


# Calculate the mean salary
mean_salary = df['Salary'].mean()
print("Mean Salary:", mean_salary)

Output:


Mean Salary: 64275.0

Explanation:

The mean salary of the employees is calculated using the mean function on the 'Salary' column.

11. Grouping and Aggregation: Calculate Maximum Age per City

You can group the data by a specific column and calculate aggregation functions within each group.


# Group by 'City' and calculate maximum age
max_age_per_city = df.groupby('City')['Age'].max()
print(max_age_per_city)

Output:


City
Chicago          28
New York         25
San Francisco    30
Name: Age, dtype: int64

Explanation:

The groupby function is used to group the data by the 'City' column, and then the max function is applied to calculate the maximum age within each group.

12. Joining DataFrames: Merge Employee and Department Data

You can merge two DataFrames based on a common column.


# Sample Department DataFrame
department_data = {'City': ['New York', 'San Francisco'],
                   'Department': ['HR', 'Finance']}
department_df = pd.DataFrame(department_data)

# Merge Employee and Department DataFrames
merged_df = pd.merge(df, department_df, on='City')
print(merged_df)

Output:


     Name  Age           City  Salary Department
0    John   25       New York   78500        HR
1  Robert   22       New York   65000        HR
2     Bob   30  San Francisco   62000   Finance
3    Mary   24  San Francisco   73000   Finance

Explanation:

The merge function is used to combine the Employee DataFrame with the Department DataFrame based on the common 'City' column.

13. Filtering Data: Select Employees with Salary Greater Than 70000

You can filter rows based on certain conditions.


# Select employees with salary greater than 70000
high_salary_employees = df[df['Salary'] > 70000]
print(high_salary_employees)

Output:


   Name  Age           City  Salary
0  John   25       New York   78500
3  Mary   24  San Francisco   73000

Explanation:

The DataFrame is filtered using a condition df['Salary'] > 70000 to select only those employees whose salary is greater than 70000.

14. Sorting Data: Sort Employees by Age in Descending Order

You can sort the DataFrame based on one or more columns.


# Sort employees by age in descending order
sorted_by_age_desc = df.sort_values(by='Age', ascending=False)
print(sorted_by_age_desc)

Output:


     Name  Age           City  Salary
2     Bob   30  San Francisco   62000
0    John   25       New York   78500
3    Mary   24  San Francisco   73000
1  Robert   22       New York   65000

Explanation:

The sort_values function is used to sort the DataFrame based on the 'Age' column in descending order.

15. Grouping and Aggregating Data: Calculate Average Salary by City

You can group data based on a column and then perform aggregation functions.


# Group by city and calculate average salary
avg_salary_by_city = df.groupby('City')['Salary'].mean()
print(avg_salary_by_city)

Output:


City
New York         71750.0
San Francisco    67500.0
Name: Salary, dtype: float64

Explanation:

The groupby function is used to group the data by the 'City' column, and then the mean function is applied to the 'Salary' column to calculate the average salary for each city.

16. Merging DataFrames: Merge Employee and Department DataFrames

You can merge two DataFrames based on a common column.


# Create a Department DataFrame
department_data = {'DepartmentID': [1, 2, 3],
                    'DepartmentName': ['HR', 'Finance', 'IT']}
departments = pd.DataFrame(department_data)

# Merge Employee and Department DataFrames
merged_df = pd.merge(df, departments, left_on='DepartmentID', right_on='DepartmentID')
print(merged_df)

Output:


     Name  Age           City  Salary  DepartmentID DepartmentName
0    John   25       New York   78500            1             HR
1  Robert   22       New York   65000            1             HR
2    Mary   24  San Francisco   73000            2        Finance
3     Bob   30  San Francisco   62000            3             IT

Explanation:

The merge function is used to combine the Employee DataFrame and the Department DataFrame based on the 'DepartmentID' column.

17. Sorting Data: Sort Employees by Salary in Descending Order

You can sort the DataFrame based on one or more columns.


# Sort employees by salary in descending order
sorted_df = df.sort_values(by='Salary', ascending=False)
print(sorted_df)

Output:


     Name  Age           City  Salary  DepartmentID
0    John   25       New York   78500            1
2    Mary   24  San Francisco   73000            2
1  Robert   22       New York   65000            1
3     Bob   30  San Francisco   62000            3

Explanation:

The sort_values function is used to sort the DataFrame by the 'Salary' column in descending order.

18. Dropping Columns: Remove the DepartmentID Column

You can drop unnecessary columns from the DataFrame.


# Drop the DepartmentID column
df_without_dept = df.drop(columns='DepartmentID')
print(df_without_dept)

Output:


     Name  Age           City  Salary
0    John   25       New York   78500
1  Robert   22       New York   65000
2    Mary   24  San Francisco   73000
3     Bob   30  San Francisco   62000

Explanation:

The drop function is used to remove the 'DepartmentID' column from the DataFrame.

19. Filtering Data: Get Employees with Salary Above 70000

You can filter rows based on a condition.


# Filter employees with salary above 70000
high_salary_df = df[df['Salary'] > 70000]
print(high_salary_df)

Output:


   Name  Age           City  Salary  DepartmentID
0  John   25       New York   78500            1
2  Mary   24  San Francisco   73000            2

Explanation:

We use boolean indexing to filter rows where the 'Salary' column is greater than 70000.

20. Grouping Data: Calculate Average Salary by City

You can group data based on one or more columns and perform aggregate functions.


# Group by city and calculate average salary
average_salary_by_city = df.groupby('City')['Salary'].mean()
print(average_salary_by_city)

Output:


City
New York         71750.0
San Francisco    67500.0
Name: Salary, dtype: float64

Explanation:

We use the groupby function to group the data by the 'City' column and then calculate the mean of the 'Salary' column for each group.

21. Renaming Columns: Rename DepartmentID to DeptID

You can rename columns in a DataFrame using the rename method.


# Rename DepartmentID column to DeptID
df.rename(columns={'DepartmentID': 'DeptID'}, inplace=True)
print(df)

Output:


    Name  Age           City  Salary  DeptID
0   John   25       New York   78500       1
1  Alice   28  San Francisco   62000       2
2   Mary   24  San Francisco   73000       2

Explanation:

We use the rename method and provide a dictionary to specify the old column name as the key and the new column name as the value. The inplace=True argument makes the changes in-place.

22. Merging DataFrames: Merge Employee and Department Data

You can merge two DataFrames using the merge function.


# Create department DataFrame
department_data = {'DeptID': [1, 2], 'DepartmentName': ['HR', 'Finance']}
department_df = pd.DataFrame(department_data)

# Merge employee and department DataFrames
merged_df = pd.merge(df, department_df, on='DeptID')
print(merged_df)

Output:


    Name  Age           City  Salary  DeptID DepartmentName
0   John   25       New York   78500       1             HR
1  Alice   28  San Francisco   62000       2        Finance
2   Mary   24  San Francisco   73000       2        Finance

Explanation:

We create a new DataFrame department_df to represent the department information. Then, we use the merge function to merge the df DataFrame with the department_df DataFrame based on the 'DeptID' column.

23. Grouping and Aggregation: Calculate Average Salary by Department

You can use the groupby method to group the DataFrame by a specific column and then apply aggregation functions.


# Group by DepartmentName and calculate average salary
average_salary_by_department = merged_df.groupby('DepartmentName')['Salary'].mean()
print(average_salary_by_department)

Output:


DepartmentName
Finance    67500.0
HR         78500.0
Name: Salary, dtype: float64

Explanation:

We use the groupby method to group the merged DataFrame by the 'DepartmentName' column. Then, we calculate the average salary for each department using the mean function on the 'Salary' column within each group.

24. Pivot Table: Create a Pivot Table of Average Salary by Department and Age

Pivot tables allow you to create multi-dimensional summaries of data.


# Create a pivot table of average salary by DepartmentName and Age
pivot_table = merged_df.pivot_table(values='Salary', index='DepartmentName', columns='Age', aggfunc='mean')
print(pivot_table)

Output:


Age                   24      25      28
DepartmentName                        
Finance          73000.0     NaN  62000.0
HR                   NaN  78500.0      NaN

Explanation:

We use the pivot_table method to create a pivot table that displays the average salary for each combination of 'DepartmentName' and 'Age'. The aggfunc='mean' argument specifies that the aggregation function should be the mean.

25. Selecting Rows Based on Conditions

You can filter rows from a DataFrame based on certain conditions using boolean indexing.


# Select rows where 'Age' is greater than 25
selected_rows = merged_df[merged_df['Age'] > 25]
print(selected_rows)

Output:


   EmployeeID  Name  Age DepartmentName  Salary
1           2  Jane   28             HR   78500

Explanation:

We use boolean indexing to filter rows where the 'Age' column is greater than 25.

26. Sorting DataFrame by Columns

You can sort a DataFrame based on one or more columns using the sort_values function.


# Sort DataFrame by 'Salary' column in descending order
sorted_df = merged_df.sort_values(by='Salary', ascending=False)
print(sorted_df)

Output:


   EmployeeID   Name  Age DepartmentName  Salary
0           1   John   25        Finance   80000
1           2   Jane   28             HR   78500
2           3  Alice   24        Finance   72000

Explanation:

We use the sort_values function to sort the DataFrame based on the 'Salary' column in descending order.

29. Grouping Data

You can group data based on one or more columns using the groupby function.


# Group data by 'DepartmentName' and calculate the average salary
grouped_data = merged_df.groupby('DepartmentName')['Salary'].mean()
print(grouped_data)

Output:


DepartmentName
Finance    76000.0
HR         78500.0
Name: Salary, dtype: float64

Explanation:

We use the groupby function to group the data by the 'DepartmentName' column and then calculate the average salary for each group.

Example image

30. Merging DataFrames

You can merge two DataFrames using the merge function.


# Create a new DataFrame with department-wise average salary
department_avg_salary = merged_df.groupby('DepartmentName')['Salary'].mean().reset_index()

# Merge the original DataFrame with the department-wise average salary DataFrame
merged_with_avg_salary = pd.merge(merged_df, department_avg_salary, on='DepartmentName', suffixes=('', '_avg'))
print(merged_with_avg_salary)

Output:


   EmployeeID   Name  Age DepartmentName  Salary  Salary_avg
0           1   John   25        Finance   80000     76000.0
1           3  Alice   24        Finance   72000     76000.0
2           2   Jane   28             HR   78500     78500.0

Explanation:

We first calculate the average salary for each department using the groupby function and create a new DataFrame. Then, we use the merge function to combine the original DataFrame with the department-wise average salary DataFrame based on the 'DepartmentName' column.

31. Pivoting Data

You can pivot data using the pivot_table function.


# Create a pivot table to display average salary for each department and age
pivot_table = merged_df.pivot_table(index='DepartmentName', columns='Age', values='Salary', aggfunc='mean')
print(pivot_table)

Output:


Age                  24      25      28
DepartmentName                        
Finance          72000.0  80000.0     NaN
HR                   NaN     NaN  78500.0

Explanation:

We use the pivot_table function to create a pivot table that displays the average salary for each department and age combination.

35. Working with Missing Data

Missing data can be handled using various functions.


# Check for missing values in the DataFrame
missing_values = merged_df.isnull().sum()
print(missing_values)

Output:


EmployeeID         0
Name               0
Age                0
DepartmentName     0
Salary             0
dtype: int64

Explanation:

The isnull() function checks for missing values in the DataFrame and returns a boolean DataFrame. The sum() function then calculates the total number of missing values for each column.

36. Handling Missing Data

Missing data can be filled using the fillna function.


# Fill missing values in the 'Age' column with the mean age
merged_df['Age'].fillna(merged_df['Age'].mean(), inplace=True)
print(merged_df)

Output:


   EmployeeID   Name  Age DepartmentName  Salary
0           1   John   25        Finance   80000
1           2   Jane   28             HR   78500
2           3  Alice   24        Finance   72000

Explanation:

The fillna() function is used to fill missing values in the 'Age' column with the mean age of the dataset. The inplace=True parameter applies the changes to the original DataFrame.

37. Exporting Data to CSV

DataFrames can be exported to CSV files using the to_csv function.


# Export the DataFrame to a CSV file
merged_df.to_csv('employee_data.csv', index=False)

Output:

A CSV file named 'employee_data.csv' will be created in the working directory.

Explanation:

The to_csv() function is used to export the DataFrame to a CSV file. The index=False parameter prevents the index column from being exported.

38. Exporting Data to Excel

DataFrames can be exported to Excel files using the to_excel function.


# Export the DataFrame to an Excel file
merged_df.to_excel('employee_data.xlsx', index=False)

Output:

An Excel file named 'employee_data.xlsx' will be created in the working directory.

Explanation:

The to_excel() function is used to export the DataFrame to an Excel file. The index=False parameter prevents the index column from being exported.

39. Merging DataFrames

DataFrames can be merged using the merge function.


# Merge two DataFrames based on a common column
merged_data = pd.merge(employee_df, department_df, on='DepartmentID')
print(merged_data)

Output:


   EmployeeID   Name  Age  DepartmentID  Salary DepartmentName
0           1   John   25             1   80000        Finance
1           3  Alice   24             1   72000        Finance
2           2   Jane   28             2   78500             HR

Explanation:

The merge() function is used to merge two DataFrames based on a common column, in this case, 'DepartmentID'. The resulting DataFrame contains columns from both original DataFrames.

40. Grouping and Aggregating Data

Data can be grouped and aggregated using the groupby function.


# Group data by department and calculate average salary
average_salary = merged_data.groupby('DepartmentName')['Salary'].mean()
print(average_salary)

Output:


DepartmentName
Finance    76000.0
HR         78500.0
Name: Salary, dtype: float64

Explanation:

The groupby() function is used to group the data by the 'DepartmentName' column. The mean() function calculates the average salary for each department.

41. Pivot Tables

Pivot tables can be created using the pivot_table function.


# Create a pivot table to display average salary by department and age
pivot_table = merged_data.pivot_table(values='Salary', index='DepartmentName', columns='Age', aggfunc='mean')
print(pivot_table)

Output:


Age                24      25      28
DepartmentName                       
Finance        72000.0  80000.0     NaN
HR                  NaN     NaN  78500.0

Explanation:

The pivot_table() function creates a pivot table that displays the average salary by department and age. The values parameter specifies the column to aggregate, the index parameter specifies the rows (DepartmentName), the columns parameter specifies the columns (Age), and the aggfunc parameter specifies the aggregation function to use.

45. Handling Missing Data

Missing data can be handled using functions like fillna and dropna.


# Fill missing values with a specific value
df_filled = df.fillna(value=0)

# Drop rows with missing values
df_dropped = df.dropna()

Explanation:

The fillna() function is used to fill missing values in the DataFrame with a specified value, such as 0 in this case. The dropna() function is used to remove rows with missing values from the DataFrame.

46. Sorting Data

DataFrames can be sorted using the sort_values function.


# Sort DataFrame by 'Salary' in ascending order
sorted_df = df.sort_values(by='Salary')
print(sorted_df)

Output:


   EmployeeID   Name  Age  DepartmentID  Salary
1           3  Alice   24             1   72000
0           1   John   25             1   80000
2           2   Jane   28             2   78500

Explanation:

The sort_values() function is used to sort the DataFrame by a specified column, in this case, 'Salary'. The resulting DataFrame is sorted in ascending order by default.

47. Merging DataFrames with Different Columns

DataFrames with different columns can be merged using the merge function with the how parameter.


# Merge two DataFrames with different columns using an outer join
merged_data = pd.merge(df1, df2, how='outer', left_on='ID', right_on='EmployeeID')
print(merged_data)

Output:


   ID   Name  Age  EmployeeID   Department
0   1   John   25           1      Finance
1   2   Jane   28           2           HR
2   3  Alice   24           3      Finance
3   4    Bob   30         NaN          NaN

Explanation:

The merge() function can handle merging DataFrames with different columns using different types of joins. In this example, an outer

42. Applying Functions to Columns

You can apply custom functions to columns using the apply function.


# Define a custom function
def double_salary(salary):
    return salary * 2

# Apply the custom function to the 'Salary' column
df['Doubled Salary'] = df['Salary'].apply(double_salary)
print(df)

Output:


   EmployeeID   Name  Age  DepartmentID  Salary  Doubled Salary
0           1   John   25             1   80000          160000
1           2   Jane   28             2   78500          157000
2           3  Alice   24             1   72000          144000

Explanation:

The apply() function is used to apply a custom function to each element of a column. In this example, a custom function double_salary is defined to double the salary of each employee, and the function is applied to the 'Salary' column using df['Salary'].apply(double_salary). The result is a new column 'Doubled Salary' containing the doubled salary values.

43. Creating Pivot Tables

Pivot tables can be created using the pivot_table function.


# Create a pivot table with 'Department' as columns and 'Age' as values
pivot_table = df.pivot_table(values='Age', columns='Department', aggfunc='mean')
print(pivot_table)

Output:


Department   Finance    HR
Age               24    28

Explanation:

The pivot_table() function is used to create a pivot table from a DataFrame. In this example, the pivot table has 'Department' as columns and 'Age' as values, with the aggregation function 'mean' to calculate the average age for each department.

44. Grouping and Aggregating Data

Data can be grouped and aggregated using the groupby function.


# Group data by 'Department' and calculate the average age and salary
grouped_data = df.groupby('Department').agg({'Age': 'mean', 'Salary': 'mean'})
print(grouped_data)

Output:


              Age   Salary
Department                
Finance      24.5  76000.0
HR           28.0  78500.0

Explanation:

The groupby() function is used to group the data based on a specified column, in this case, 'Department'. The agg() function is then used to apply aggregation functions to the grouped data. In this example, the average age and salary for each department are calculated.

45. Merging DataFrames

DataFrames can be merged using the merge function.


# Create two DataFrames
df1 = pd.DataFrame({'EmployeeID': [1, 2, 3],
                    'Name': ['John', 'Jane', 'Alice'],
                    'DepartmentID': [1, 2, 1]})
df2 = pd.DataFrame({'DepartmentID': [1, 2],
                    'DepartmentName': ['Finance', 'HR']})

# Merge the DataFrames based on 'DepartmentID'
merged_df = pd.merge(df1, df2, on='DepartmentID')
print(merged_df)

Output:


   EmployeeID   Name  DepartmentID DepartmentName
0           1   John             1        Finance
1           3  Alice             1        Finance
2           2   Jane             2             HR

Explanation:

The merge() function is used to merge two DataFrames based on a common column, in this case, 'DepartmentID'. The result is a new DataFrame containing the combined data from both DataFrames.

46. Handling Missing Values

Missing values can be handled using functions like dropna and fillna.


# Drop rows with any missing values
cleaned_df = df.dropna()
print(cleaned_df)

# Fill missing values with a specific value
filled_df = df.fillna(value=0)
print(filled_df)

Output:


   EmployeeID   Name  Age  DepartmentID  Salary
0           1   John   25             1   80000
1           2   Jane   28             2   78500
2           3  Alice   24             1   72000

   EmployeeID   Name  Age  DepartmentID  Salary
0           1   John   25             1   80000
1           2   Jane   28             2   78500
2           3  Alice   24             1   72000

Explanation:

The dropna() function is used to remove rows with any missing values, while the fillna() function is used to fill missing values with a specified value, in this case, 0.

47. Changing Column Data Types

Column data types can be changed using the astype function.


# Change the data type of 'Salary' column to float
df['Salary'] = df['Salary'].astype(float)
print(df.dtypes)

Output:


EmployeeID        int64
Name             object
Age               int64
DepartmentID      int64
Salary          float64
dtype: object

Explanation:

The astype() function is used to change the data type of a column. In this example, the data type of the 'Salary' column is changed from integer to float.

48. Grouping and Aggregating Data

Data can be grouped and aggregated using the groupby function.


# Group data by 'DepartmentID' and calculate the average salary
grouped_df = df.groupby('DepartmentID')['Salary'].mean()
print(grouped_df)

Output:


DepartmentID
1    76000.0
2    78500.0
Name: Salary, dtype: float64

Explanation:

The groupby() function is used to group data based on a specified column, in this case, 'DepartmentID'. The mean() function is then applied to calculate the average salary for each department.

49. Pivoting Data

Data can be pivoted using the pivot_table function.


# Pivot the data to show average salary by department and age
pivot_df = df.pivot_table(values='Salary', index='DepartmentID', columns='Age', aggfunc='mean')
print(pivot_df)

Output:


Age                  24      25      28
DepartmentID
1              76000.0  80000.0     NaN
2                  NaN     NaN  78500.0

Explanation:

The pivot_table() function is used to create a pivot table that displays the average salary by department and age. The values parameter specifies the column to aggregate, the index parameter specifies the rows, the columns parameter specifies the columns, and the aggfunc parameter specifies the aggregation function to use.

50. Exporting Data to CSV

Data can be exported to a CSV file using the to_csv function.


# Export DataFrame to a CSV file
df.to_csv('employee_data.csv', index=False)

Explanation:

The to_csv() function is used to export a DataFrame to a CSV file. The index parameter is set to False to exclude the index column from the exported CSV file.

51. Exporting Data to Excel

Data can be exported to an Excel file using the to_excel function.


# Export DataFrame to an Excel file
df.to_excel('employee_data.xlsx', index=False)

Explanation:

The to_excel() function is used to export a DataFrame to an Excel file. The index parameter is set to False to exclude the index column from the exported Excel file.

52. Joining DataFrames

DataFrames can be joined using the merge function.


# Join two DataFrames based on a common column
result_df = pd.merge(df1, df2, on='EmployeeID')
print(result_df)

Output:


   EmployeeID   Name_x  Salary_x   Name_y  Salary_y
0           1     John     60000    Alice     55000
1           2     Mary     75000    Bob       60000

Explanation:

The merge() function is used to combine two DataFrames based on a common column, in this case, 'EmployeeID'. The result is a new DataFrame containing the combined data from both DataFrames.

53. Merging DataFrames

DataFrames can be merged using the merge function with different types of joins.


# Merge DataFrames with different types of joins
inner_join_df = pd.merge(df1, df2, on='EmployeeID', how='inner')
left_join_df = pd.merge(df1, df2, on='EmployeeID', how='left')
right_join_df = pd.merge(df1, df2, on='EmployeeID', how='right')
outer_join_df = pd.merge(df1, df2, on='EmployeeID', how='outer')

Explanation:

The merge() function can perform different types of joins based on the how parameter. The available options are 'inner' (intersection of keys), 'left' (keys from the left DataFrame), 'right' (keys from the right DataFrame), and 'outer' (union of keys).

54. Handling Missing Data

Missing data can be handled using the fillna function.


# Fill missing values with a specific value
df['Salary'].fillna(0, inplace=True)

Explanation:

The fillna() function is used to fill missing values in a specific column with a specified value. The inplace parameter is set to True to modify the DataFrame in place.

55. Grouping and Aggregating Data

Data can be grouped and aggregated using the groupby function.


# Grouping and aggregating data
grouped_df = df.groupby('Department')['Salary'].mean()
print(grouped_df)

Output:


Department
HR         65000
IT         70000
Sales      60000
Name: Salary, dtype: int64

Explanation:

The groupby() function is used to group the data by a specific column (in this case, 'Department'). The mean() function is then applied to the 'Salary' column to calculate the average salary for each department.

56. Pivot Tables

Pivot tables can be created using the pivot_table function.


# Creating a pivot table
pivot_table = df.pivot_table(index='Department', values='Salary', aggfunc='mean')
print(pivot_table)

Output:


Department
HR         65000
IT         70000
Sales      60000
Name: Salary, dtype: int64

Explanation:

The pivot_table() function is used to create a pivot table that summarizes and aggregates data based on specified columns. In this example, a pivot table is created with the 'Department' column as the index and the 'Salary' column values are aggregated using the mean() function.

57. Creating a Bar Plot

Bar plots can be created using the plot function.


# Creating a bar plot
df['Salary'].plot(kind='bar')
plt.xlabel('Employee')
plt.ylabel('Salary')
plt.title('Employee Salaries')
plt.show()

Output:

An interactive bar plot will be displayed.

Explanation:

The plot() function can be used to create various types of plots, including bar plots. The kind parameter is set to 'bar' to indicate that a bar plot should be created. Additional labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions. Finally, the show() function is used to display the plot.

58. Creating a Histogram

Histograms can be created using the hist function.


# Creating a histogram
df['Age'].hist(bins=10)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()

Output:

An interactive histogram will be displayed.

Explanation:

The hist() function is used to create a histogram plot. The bins parameter determines the number of bins or intervals in the histogram. Additional labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions. Finally, the show() function is used to display the plot.

59. Creating a Box Plot

Box plots can be created using the boxplot function.


# Creating a box plot
df.boxplot(column='Salary', by='Department')
plt.xlabel('Department')
plt.ylabel('Salary')
plt.title('Salary Distribution by Department')
plt.suptitle('')
plt.show()

Output:

An interactive box plot will be displayed.

Explanation:

The boxplot() function is used to create a box plot that visualizes the distribution of a numerical variable ('Salary') based on different categories ('Department'). The column parameter specifies the column to plot, and the by parameter specifies the grouping variable. Labels and titles are added to the plot using the xlabel(), ylabel(), and title() functions. The suptitle() function is used to remove the default title that comes with the plot. Finally, the show() function is used to display the plot.

60. Creating a Scatter Plot

Scatter plots can be created using the scatter function.


# Creating a scatter plot
df.plot.scatter(x='Age', y='Salary')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.title('Age vs Salary')
plt.show()

Output:

An interactive scatter plot will be displayed.

Explanation:

The plot.scatter() function is used to create a scatter plot that visualizes the relationship between two numerical variables ('Age' and 'Salary'). The x parameter specifies the x-axis variable, and the y parameter specifies the y-axis variable. Labels and a title are added to the plot using the xlabel(), ylabel(), and title() functions. Finally, the show() function is used to display the plot.

61. Filtering Rows with Multiple Conditions

You can filter rows in a DataFrame based on multiple conditions using the & (AND) and | (OR) operators.


# Filtering rows with multiple conditions
filtered_df = df[(df['Age'] >= 30) & (df['Salary'] >= 50000)]

Explanation:

The example demonstrates how to filter rows in a DataFrame based on multiple conditions. In this case, we are filtering for rows where the 'Age' is greater than or equal to 30 and the 'Salary' is greater than or equal to 50000. The & operator is used to perform an element-wise AND operation on the conditions.

62. Grouping Data and Calculating Aggregates

You can use the groupby function to group data by one or more columns and then apply aggregate functions to the grouped data.


# Grouping data and calculating aggregates
grouped_df = df.groupby('Department')['Salary'].mean()

Explanation:

The groupby() function is used to group the data by a specific column ('Department' in this case). Then, the ['Salary'].mean() expression calculates the mean salary for each department. The result is a new DataFrame with department names as the index and the corresponding mean salaries.

63. Merging DataFrames

You can merge two DataFrames based on a common column using the merge function.


# Merging DataFrames
merged_df = pd.merge(df1, df2, on='EmployeeID')

Explanation:

The merge() function is used to merge two DataFrames ('df1' and 'df2') based on a common column ('EmployeeID' in this case). The result is a new DataFrame containing the combined data from both original DataFrames.

64. Handling Missing Data

Missing data can be handled using the fillna function or by dropping rows with missing values using the dropna function.


# Handling missing data
df.fillna(value=0, inplace=True)

Explanation:

The fillna() function is used to fill missing values in the DataFrame with a specified value (in this case, 0). The inplace=True parameter updates the DataFrame in place with the filled values.

65. Pivoting DataFrames

You can pivot a DataFrame using the pivot function to reshape the data based on column values.


# Pivoting a DataFrame
pivot_df = df.pivot(index='Date', columns='Product', values='Sales')

Explanation:

The pivot() function is used to reshape the DataFrame. In this example, the DataFrame is pivoted based on the 'Date' and 'Product' columns, and the 'Sales' column values are used as the values for the pivoted DataFrame.

66. Melting DataFrames

Melting a DataFrame can help convert it from a wide format to a long format using the melt function.


# Melting a DataFrame
melted_df = pd.melt(df, id_vars=['Date'], value_vars=['Product_A', 'Product_B'])

Explanation:

The melt() function is used to transform the DataFrame from wide format to long format. The 'Date' column is kept as the identifier variable, and the 'Product_A' and 'Product_B' columns are melted into a single column called 'variable', and their corresponding values are in the 'value' column.

67. Reshaping DataFrames with Stack and Unstack

You can use the stack and unstack functions to reshape DataFrames by stacking and unstacking levels of the index or columns.


# Stacking and unstacking DataFrames
stacked_df = df.stack()
unstacked_df = df.unstack()

Explanation:

The stack() function is used to stack the specified level(s) of columns to produce a Series with a MultiIndex. The unstack() function is used to unstack the specified level(s) of the index to produce a DataFrame with reshaped columns.

68. Creating Pivot Tables

Pivot tables can be created using the pivot_table function to summarize and analyze data.


# Creating a pivot table
pivot_table_df = df.pivot_table(index='Department', values='Salary', aggfunc='mean')

Explanation:

The pivot_table() function is used to create a pivot table. In this example, the pivot table is based on the 'Department' column, and the 'Salary' column values are aggregated using the mean function.

69. Grouping Data in a DataFrame

You can group data in a DataFrame using the groupby function to perform aggregate operations on grouped data.


# Grouping data and calculating mean
grouped_df = df.groupby('Category')['Price'].mean()

Explanation:

The groupby() function is used to group data based on a column ('Category' in this case). The mean() function is then applied to the 'Price' column within each group to calculate the average price for each category.

70. Merging DataFrames

DataFrames can be merged using the merge function to combine data from different sources based on common columns.


# Merging DataFrames
merged_df = pd.merge(df1, df2, on='common_column')

Explanation:

The merge() function is used to combine data from two DataFrames based on a common column ('common_column' in this case). The resulting DataFrame contains columns from both original DataFrames, aligned based on the matching values in the common column.

71. Concatenating DataFrames

DataFrames can be concatenated using the concat function to combine them vertically or horizontally.


# Concatenating DataFrames vertically
concatenated_df = pd.concat([df1, df2])

# Concatenating DataFrames horizontally
concatenated_df = pd.concat([df1, df2], axis=1)

Explanation:

The concat() function is used to concatenate DataFrames either vertically (default) or horizontally (if axis=1 is specified). This is useful when you want to combine data from different sources into a single DataFrame.

72. Handling Missing Data

Missing data can be handled using functions like dropna, fillna, and interpolate.


# Dropping rows with missing values
cleaned_df = df.dropna()

# Filling missing values with a specific value
filled_df = df.fillna(value)

# Interpolating missing values
interpolated_df = df.interpolate()

Explanation:

Missing data can be handled using various methods. The dropna() function removes rows with missing values, the fillna() function fills missing values with a specified value, and the interpolate() function fills missing values using interpolation methods.

73. Reshaping DataFrames

DataFrames can be reshaped using functions like pivot, melt, and stack/unstack.


# Pivot table
pivot_table = df.pivot_table(index='Index', columns='Column', values='Value')

# Melt DataFrame
melted_df = pd.melt(df, id_vars=['ID'], value_vars=['Var1', 'Var2'])

# Stack and unstack
stacked_df = df.stack()
unstacked_df = df.unstack()

Explanation:

DataFrames can be reshaped to change the layout of the data. The pivot_table() function creates a pivot table based on the provided columns, melt() function is used to transform wide data into long format, and stack() and unstack() functions change between multi-level indexed and unindexed representations.

74. Aggregating Data in Groups

DataFrames can be grouped and aggregated using functions like groupby and agg.


# Grouping and aggregating data
grouped = df.groupby('Category')['Value'].agg(['mean', 'sum'])

Explanation:

The groupby() function is used to group data based on a column ('Category' in this case), and the agg() function is then used to perform aggregate operations (e.g., mean, sum) on the grouped data.

75. Applying Functions to Columns

You can apply functions to DataFrame columns using apply or applymap.


# Applying a function to a column
df['NewColumn'] = df['Column'].apply(function)

# Applying a function element-wise to all columns
transformed_df = df.applymap(function)

Explanation:

The apply() function can be used to apply a function to a specific column. The applymap() function is used to apply a function element-wise to all columns in the DataFrame.

76. Using Lambda Functions

Lambda functions can be used for concise operations within DataFrames.


# Applying a lambda function
df['NewColumn'] = df['Column'].apply(lambda x: x * 2)

Explanation:

Lambda functions provide a concise way to define small operations directly within a function call. In this case, the lambda function is applied to each element of the 'Column' and the result is assigned to the 'NewColumn'.

77. Handling Missing Data

Dealing with missing data is a common task in data analysis. Pandas provides various functions to handle missing values.


# Check for missing values
missing_values = df.isnull().sum()

# Drop rows with missing values
cleaned_df = df.dropna()

# Fill missing values with a specific value
df_filled = df.fillna(value)

Explanation:

The isnull() function is used to identify missing values in the DataFrame. The dropna() function is used to remove rows containing missing values, and the fillna() function is used to fill missing values with a specified value.

78. Removing Duplicates

Removing duplicate rows is essential to ensure data accuracy and consistency.


# Removing duplicates based on all columns
deduplicated_df = df.drop_duplicates()

# Removing duplicates based on specific columns
deduplicated_specific_df = df.drop_duplicates(subset=['Column1', 'Column2'])

Explanation:

The drop_duplicates() function removes duplicate rows from the DataFrame. You can specify columns using the subset parameter to consider only certain columns for duplicate removal.

79. Sorting DataFrames

DataFrames can be sorted using the sort_values function.


# Sorting by a single column
sorted_df = df.sort_values(by='Column')

# Sorting by multiple columns
sorted_multi_df = df.sort_values(by=['Column1', 'Column2'], ascending=[True, False])

Explanation:

The sort_values() function is used to sort the DataFrame based on one or more columns. The by parameter specifies the columns to sort by, and the ascending parameter determines whether the sorting is in ascending or descending order.

80. Exporting DataFrames

DataFrames can be exported to various file formats using functions like to_csv, to_excel, and to_sql.


# Export to CSV
df.to_csv('output.csv', index=False)

# Export to Excel
df.to_excel('output.xlsx', index=False)

# Export to SQL database
df.to_sql('table_name', connection_object, if_exists='replace')

Explanation:

DataFrames can be exported to various file formats using functions like to_csv() for CSV files, to_excel() for Excel files, and to_sql() to store data in a SQL database. The index parameter specifies whether to include the index in the exported file.

81. Grouping and Aggregating Data

Grouping data allows you to perform aggregate operations on specific subsets of data.


# Grouping by a single column and calculating mean
grouped_mean = df.groupby('Category')['Value'].mean()

# Grouping by multiple columns and calculating sum
grouped_sum = df.groupby(['Category1', 'Category2'])['Value'].sum()

Explanation:

The groupby() function is used to group the DataFrame based on one or more columns. Aggregate functions like mean(), sum(), count(), etc., can then be applied to the grouped data to calculate summary statistics.

82. Reshaping Data

DataFrames can be reshaped using functions like melt and pivot.


# Melting the DataFrame
melted_df = pd.melt(df, id_vars=['ID'], value_vars=['Value1', 'Value2'])

# Creating a pivot table
pivot_table = df.pivot_table(index='Category', columns='Date', values='Value', aggfunc='sum')

Explanation:

The melt() function is used to transform the DataFrame from wide format to long format. The pivot_table() function is used to create a pivot table, aggregating data based on specified rows, columns, and values.

83. Combining DataFrames

DataFrames can be combined using functions like concat, merge, and join.


# Concatenating DataFrames vertically
concatenated_df = pd.concat([df1, df2], axis=0)

# Merging DataFrames based on a common column
merged_df = pd.merge(df1, df2, on='ID', how='inner')

# Joining DataFrames based on index
joined_df = df1.join(df2, how='outer')

Explanation:

The concat() function is used to concatenate DataFrames vertically or horizontally. The merge() function is used to merge DataFrames based on common columns, and the join() function is used to join DataFrames based on index.

84. Time Series Analysis

Pandas provides functionality for working with time series data.


# Converting to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Resampling time series data
resampled_df = df.resample('W').sum()

# Rolling mean calculation
rolling_mean = df['Value'].rolling(window=7).mean()

Explanation:

Pandas allows you to work with time series data by converting date columns to datetime format, resampling data at different frequencies, and calculating rolling statistics like moving averages.

85. Visualizing Data

Data visualization is crucial for understanding patterns and trends in data.


import matplotlib.pyplot as plt
import seaborn as sns

# Line plot
plt.plot(df['Date'], df['Value'])
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Value over Time')
plt.show()

# Scatter plot
sns.scatterplot(x='X', y='Y', data=df)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot')
plt.show()

Explanation:

Matplotlib and Seaborn libraries are commonly used for data visualization in Python. You can create various types of plots, including line plots and scatter plots, to visualize relationships and trends in your data.

86. Handling Missing Data

Dealing with missing data is essential for accurate analysis.


# Checking for missing values
missing_values = df.isnull().sum()

# Dropping rows with missing values
df_cleaned = df.dropna()

# Filling missing values with a specific value
df_filled = df.fillna(0)

Explanation:

The isnull() function is used to identify missing values in a DataFrame. You can then use dropna() to remove rows or columns with missing values, and fillna() to replace missing values with a specific value.

87. Data Transformation

You can perform various data transformation operations to prepare data for analysis.


# Applying a function to a column
df['Transformed_Column'] = df['Value'].apply(lambda x: x * 2)

# Applying a function element-wise
df_transformed = df.applymap(lambda x: x.upper() if isinstance(x, str) else x)

# Binning data into categories
df['Category'] = pd.cut(df['Value'], bins=[0, 10, 20, 30], labels=['Low', 'Medium', 'High'])

Explanation:

Data transformation involves modifying, adding, or removing columns in a DataFrame to create new features or prepare data for analysis. You can use functions like apply() and applymap() to transform data based on custom functions.

88. Working with Categorical Data

Categorical data requires special handling to encode and analyze properly.


# Encoding categorical variables
encoded_df = pd.get_dummies(df, columns=['Category'], prefix=['Cat'], drop_first=True)

# Mapping categories to numerical values
category_mapping = {'Low': 0, 'Medium': 1, 'High': 2}
df['Category'] = df['Category'].map(category_mapping)

Explanation:

Categorical data needs to be transformed into numerical format for analysis. You can use one-hot encoding with get_dummies() to create binary columns for each category, or use map() to map categories to specific numerical values.

89. Data Aggregation and Pivot Tables

Aggregating data and creating pivot tables helps summarize information.


# Creating a pivot table
pivot_table = df.pivot_table(index='Category', columns='Month', values='Value', aggfunc='sum')

# Grouping and aggregating data
grouped = df.groupby('Category')['Value'].agg(['sum', 'mean', 'max'])

Explanation:

Pivot tables allow you to create multidimensional summaries of data. You can also use the groupby() function to group data based on specific columns and then apply aggregate functions to calculate summary statistics.

90. Exporting Data

After analysis, you might need to export your DataFrame to different formats.


# Exporting to CSV
df.to_csv('output.csv', index=False)

# Exporting to Excel
df.to_excel('output.xlsx', index=False)

# Exporting to JSON
df.to_json('output.json', orient='records')

Explanation:

Pandas provides methods to export DataFrames to various file formats, including CSV, Excel, and JSON. You can use the to_csv(), to_excel(), and to_json() functions to save your data.

91. Merging DataFrames

Combining data from multiple DataFrames can be useful for analysis.


# Inner join
merged_inner = pd.merge(df1, df2, on='ID', how='inner')

# Left join
merged_left = pd.merge(df1, df2, on='ID', how='left')

# Concatenating DataFrames
concatenated = pd.concat([df1, df2], axis=0)

Explanation:

You can merge DataFrames using different types of joins (inner, outer, left, right) with the merge() function. Use concat() to concatenate DataFrames along a specified axis.

92. Time Series Analysis

Pandas supports time series analysis and manipulation.


# Converting a column to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Resampling time series data
df_resampled = df.resample('D', on='Date').sum()

# Shifting time series data
df_shifted = df['Value'].shift(1)

Explanation:

For time series analysis, it's crucial to convert time-related columns to datetime format using pd.to_datetime(). You can resample time series data to a different frequency and apply aggregation functions using resample(). Shifting data can help in calculating differences between consecutive time periods.

93. Plotting Data

Pandas provides built-in methods for data visualization.


# Line plot
df.plot(x='Date', y='Value', kind='line', title='Line Plot')

# Bar plot
df.plot(x='Category', y='Value', kind='bar', title='Bar Plot')

# Histogram
df['Value'].plot(kind='hist', title='Histogram')

Explanation:

Pandas provides easy-to-use methods for creating various types of plots directly from DataFrames. You can create line plots, bar plots, histograms, and more using the plot() function.

94. Advanced Indexing and Selection

Pandas offers advanced indexing and selection capabilities.


# Indexing using boolean conditions
filtered_data = df[df['Value'] > 10]

# Indexing using loc and iloc
selected_data = df.loc[df['Category'] == 'High', 'Value']

# Multi-level indexing
multi_indexed = df.set_index(['Category', 'Date'])

Explanation:

You can use boolean conditions to filter rows that meet specific criteria. The loc and iloc indexers allow you to select data by label or integer-based location, respectively. Multi-level indexing lets you create hierarchical index structures.

95. Handling Duplicate Data

Duplicate data can affect analysis accuracy, so it's important to handle it.


# Checking for duplicates
duplicate_rows = df.duplicated()

# Dropping duplicates
df_deduplicated = df.drop_duplicates()

# Keeping the first occurrence of duplicates
df_first_occurrence = df.drop_duplicates(keep='first')

Explanation:

Use the duplicated() function to identify duplicate rows in a DataFrame. You can then use drop_duplicates() to remove duplicate rows, either by dropping all duplicates or keeping only the first occurrence.

96. Handling Missing Data

Missing data can be problematic for analysis, so it's important to handle it properly.


# Checking for missing values
missing_values = df.isnull()

# Dropping rows with missing values
df_no_missing = df.dropna()

# Filling missing values with a specific value
df_filled = df.fillna(0)

Explanation:

The isnull() function helps you identify missing values in your DataFrame. You can use dropna() to remove rows with missing values or fillna() to replace missing values with a specific value.

97. Aggregating Data

You can perform aggregation operations to summarize data in various ways.


# Grouping data and calculating mean
grouped_mean = df.groupby('Category')['Value'].mean()

# Grouping data and calculating sum
grouped_sum = df.groupby('Category')['Value'].sum()

# Pivot tables
pivot_table = df.pivot_table(index='Category', columns='Date', values='Value', aggfunc='mean')

Explanation:

Use the groupby() function to group data based on specific columns and perform aggregation functions such as mean, sum, count, etc. Pivot tables allow you to create a table summarizing data based on multiple dimensions.

98. Reshaping Data

You can reshape data to fit different formats using Pandas.


# Melting data from wide to long format
melted = pd.melt(df, id_vars=['Category'], value_vars=['Jan', 'Feb', 'Mar'])

# Pivoting data from long to wide format
pivoted = melted.pivot_table(index='Category', columns='variable', values='value')

Explanation:

The melt() function helps you reshape data from wide to long format, where each row represents a unique combination of variables. The pivot_table() function can then be used to reshape the long format data back to wide format.

99. Working with Text Data

Pandas supports text data manipulation and analysis.


# Extracting substrings
df['First Name'] = df['Full Name'].str.split().str[0]

# Counting occurrences of values
word_count = df['Text Column'].str.split().apply(len)

# Finding and replacing text
df['Text Column'] = df['Text Column'].str.replace('old', 'new')

Explanation:

Pandas provides methods for working with text data within columns. You can use str.split() to split text into substrings, apply() to perform operations on each element, and str.replace() to find and replace specific text within columns.

100. Exporting Data

Exporting data is essential for sharing analysis results.


# Export to CSV
df.to_csv('data.csv', index=False)

# Export to Excel
df.to_excel('data.xlsx', index=False)

# Export to JSON
df.to_json('data.json', orient='records')

Explanation:

Pandas allows you to export DataFrames to various file formats, including CSV, Excel, and JSON. Use the respective to_* functions and specify the file name. Set index=False to exclude the index column from the export.

101. Working with DateTime Data

Pandas provides tools to work with datetime data efficiently.


# Converting strings to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Extracting year, month, day
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# Calculating time differences
df['TimeDiff'] = df['End Time'] - df['Start Time']

Explanation:

Pandas provides the to_datetime() function to convert strings to datetime objects. You can use dt.year, dt.month, and dt.day to extract date components. Calculating time differences becomes straightforward by subtracting datetime columns.

102. Merging DataFrames

Combining data from multiple DataFrames can provide valuable insights.


# Merging based on common column
merged_df = pd.merge(df1, df2, on='ID')

# Merging with different column names
merged_df = pd.merge(df1, df2, left_on='ID1', right_on='ID2')

# Merging on multiple columns
merged_df = pd.merge(df1, df2, on=['ID', 'Date'])

Explanation:

Pandas offers the merge() function to combine DataFrames based on shared columns. You can specify the column to merge on using the on parameter or different columns using left_on and right_on. Merging on multiple columns is also possible by passing a list of column names.

103. Combining DataFrames

Concatenating DataFrames is useful for combining data vertically or horizontally.


# Concatenating vertically
concatenated_df = pd.concat([df1, df2])

# Concatenating horizontally
concatenated_df = pd.concat([df1, df2], axis=1)

Explanation:

The concat() function allows you to concatenate DataFrames vertically (along rows) or horizontally (along columns). Use axis=0 for vertical concatenation and axis=1 for horizontal concatenation.

104. Applying Functions to Columns

Applying functions to DataFrame columns can transform or manipulate data.


# Applying a function element-wise
df['New Column'] = df['Column'].apply(lambda x: x * 2)

# Applying a function to multiple columns
df[['Col1', 'Col2']] = df[['Col1', 'Col2']].applymap(lambda x: x.strip())

Explanation:

You can use the apply() function to apply a function element-wise to a column. To apply a function to multiple columns, use applymap(). The example demonstrates how to double the values in a column and strip whitespace from multiple columns.

105. Categorical Data

Converting data to categorical format can save memory and improve performance.


# Converting to categorical
df['Category'] = df['Category'].astype('category')

# Displaying categories
categories = df['Category'].cat.categories

# Mapping categories to numerical values
df['Category Code'] = df['Category'].cat.codes

Explanation:

By converting categorical data to the 'category' type, you can save memory and improve performance. Use the cat.categories property to display the unique categories and cat.codes to map them to numerical values.

106. Handling Missing Data

Dealing with missing data is essential for data analysis and modeling.


# Checking for missing values
missing_values = df.isnull().sum()

# Dropping rows with missing values
df_cleaned = df.dropna()

# Filling missing values with a specific value
df_filled = df.fillna(value=0)

Explanation:

Use isnull() to identify missing values in a DataFrame. The sum() function calculates the number of missing values per column. You can drop rows with missing values using dropna() or fill missing values with a specific value using fillna().

107. Aggregating Data

Aggregating data provides insights into summary statistics.


# Calculating mean, median, and sum
mean_value = df['Column'].mean()
median_value = df['Column'].median()
sum_value = df['Column'].sum()

# Grouping and aggregating
grouped_data = df.groupby('Category')['Value'].sum()

Explanation:

Aggregating data helps analyze summary statistics. Use mean(), median(), and sum() to calculate these statistics. Grouping data using groupby() allows for aggregation based on specific columns.

108. Reshaping Data

Reshaping data allows for different representations of the same information.


# Pivoting data
pivot_table = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')

# Melting data
melted_df = pd.melt(df, id_vars='Date', value_vars=['Col1', 'Col2'], var_name='Category', value_name='Value')

Explanation:

Pivoting reshapes data by creating a new table with columns based on unique values from another column. The pivot_table() function allows for customization of aggregation functions. Melting data converts wide-format data to long-format, making it more suitable for analysis.

109. Working with Text Data

Manipulating text data is common in data analysis.


# Extracting substring
df['Substr'] = df['Text'].str[0:5]

# Splitting text into columns
df[['First Name', 'Last Name']] = df['Name'].str.split(expand=True)

# Counting occurrences of a substring
df['Count'] = df['Text'].str.count('pattern')

Explanation:

Text manipulation is possible using string methods like str[0:5] to extract a substring. The str.split() function splits text into separate columns. The str.count() function counts occurrences of a substring in a column.

110. Exporting Data

Exporting data is essential for sharing analysis results.


# Exporting to CSV
df.to_csv('data.csv', index=False)

# Exporting to Excel
df.to_excel('data.xlsx', index=False)

# Exporting to JSON
df.to_json('data.json', orient='records')

Explanation:

Use to_csv() to export data to a CSV file. The to_excel() function exports to an Excel file, and to_json() exports to a JSON file with various orientations, such as 'records'.

111. Merging DataFrames

Merging data from multiple DataFrames can provide a comprehensive view of the data.


# Inner join
merged_inner = pd.merge(df1, df2, on='common_column', how='inner')

# Left join
merged_left = pd.merge(df1, df2, on='common_column', how='left')

# Concatenating DataFrames
concatenated_df = pd.concat([df1, df2], axis=0)

Explanation:

Merging DataFrames is useful for combining related data. pd.merge() performs inner and left joins on specified columns using the how parameter. The pd.concat() function concatenates DataFrames along the specified axis.

112. Time Series Analysis

Working with time series data requires specialized techniques.


# Converting to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Resampling data
daily_data = df.resample('D', on='Date').sum()

# Shifting data
df['Shifted'] = df['Value'].shift(1)

Explanation:

Time series analysis involves converting date columns to datetime format using pd.to_datetime(). Resampling data using resample() aggregates data over specified time intervals. Shifting data using shift() offsets data by a specified number of periods.

113. Working with Categorical Data

Categorical data can be encoded and analyzed effectively.


# Encoding categorical data
df['Category'] = df['Category'].astype('category')
df['Category_encoded'] = df['Category'].cat.codes

# One-hot encoding
one_hot_encoded = pd.get_dummies(df['Category'], prefix='Category')

Explanation:

Encode categorical data using astype('category') and cat.codes to assign unique codes to categories. Use pd.get_dummies() for one-hot encoding, creating separate columns for each category.

114. Data Visualization with Pandas

Data visualization helps in understanding patterns and trends.


import matplotlib.pyplot as plt

# Line plot
df.plot(x='Date', y='Value', kind='line', title='Line Plot')

# Histogram
df['Value'].plot(kind='hist', bins=10, title='Histogram')

plt.show()

Explanation:

Data visualization libraries like Matplotlib can be used to create various plots. df.plot() generates line plots, and plot(kind='hist') creates histograms for numeric data.

115. Pivot Tables

Pivot tables help summarize and analyze data from multiple dimensions.


# Creating a pivot table
pivot_table = df.pivot_table(values='Value', index='Category', columns='Date', aggfunc='sum')

# Handling missing values
pivot_table_filled = pivot_table.fillna(0)

Explanation:

Pivot tables are used to summarize data across multiple dimensions. pivot_table() creates a pivot table using specified values, index, columns, and aggregation function. fillna() is used to handle missing values by filling them with a specific value.

116. Groupby and Aggregation

Grouping data and applying aggregation functions helps in obtaining insights.


# Grouping data
grouped_data = df.groupby('Category')['Value'].sum()

# Multiple aggregations
aggregated_data = df.groupby('Category').agg({'Value': ['sum', 'mean']})

Explanation:

groupby() is used to group data based on specified columns. Aggregation functions like sum() and mean() can be applied to the grouped data. agg() allows performing multiple aggregations on different columns.

117. Working with Datetime Index

Datetime index provides flexibility in time-based analysis.


# Setting datetime index
df.set_index('Date', inplace=True)

# Resampling with datetime index
resampled_data = df.resample('M').sum()

Explanation:

Setting a datetime index using set_index() enables time-based analysis. resample() with a datetime index can be used to aggregate data over different time periods.

118. Handling Outliers

Detecting and handling outliers is crucial for accurate analysis.


# Detecting outliers using z-score
from scipy.stats import zscore
outliers = df[np.abs(zscore(df['Value'])) > 3]

# Handling outliers
df_no_outliers = df[(np.abs(zscore(df['Value'])) < 3)]

Explanation:

Outliers can be detected using the z-score method from the scipy.stats library. Values with z-scores greater than a threshold (e.g., 3) can be considered outliers. Removing outliers helps in obtaining more reliable analysis results.

119. Exporting Data

Exporting DataFrame data to various formats is essential for sharing and collaboration.


# Exporting to CSV
df.to_csv('data.csv', index=False)

# Exporting to Excel
df.to_excel('data.xlsx', index=False)

# Exporting to JSON
df.to_json('data.json', orient='records')

Explanation:

DataFrames can be exported to various formats like CSV, Excel, and JSON using to_csv(), to_excel(), and to_json() methods. Specifying index=False excludes the index column from the exported data.

120. Merging DataFrames

Merging DataFrames helps in combining data from different sources.


# Inner merge
merged_df_inner = pd.merge(df1, df2, on='Key', how='inner')

# Left merge
merged_df_left = pd.merge(df1, df2, on='Key', how='left')

# Right merge
merged_df_right = pd.merge(df1, df2, on='Key', how='right')

# Outer merge
merged_df_outer = pd.merge(df1, df2, on='Key', how='outer')

Explanation:

Merging DataFrames using the pd.merge() function combines data based on a common key. Different types of merges such as inner, left, right, and outer can be performed based on the requirement.

121. Handling Duplicates

Identifying and removing duplicate rows from DataFrames.


# Identifying duplicates
duplicate_rows = df[df.duplicated()]

# Removing duplicates
df_no_duplicates = df.drop_duplicates()

Explanation:

Duplicate rows can be identified using the duplicated() method. Removing duplicates can be done using the drop_duplicates() method, which retains the first occurrence of each duplicated row.

122. Handling Missing Values

Dealing with missing values is crucial for accurate analysis.


# Checking for missing values
missing_values = df.isnull().sum()

# Dropping rows with missing values
df_no_missing = df.dropna()

# Filling missing values
df_filled = df.fillna(0)

Explanation:

Missing values can be identified using isnull(), and the sum of missing values in each column can be calculated using sum(). Rows with missing values can be dropped using dropna(), and missing values can be filled using fillna().

123. String Operations

Performing string operations on DataFrame columns.


# Changing case
df['Column'] = df['Column'].str.lower()

# Extracting substrings
df['Substring'] = df['Column'].str.extract(r'(\d{3})')

Explanation:

String operations can be performed using the str attribute of DataFrame columns. Changing case, extracting substrings, and applying regular expressions are some common string operations.

124. Grouping and Aggregating Data

Grouping data by one or more columns and applying aggregation functions.


# Grouping and summing
grouped_sum = df.groupby('Category')['Value'].sum()

# Grouping and calculating mean
grouped_mean = df.groupby('Category')['Value'].mean()

Explanation:

Grouping data allows you to perform aggregate calculations on subsets of the data. Common aggregation functions include sum, mean, count, and more. You can use the groupby() method to specify the grouping columns and then apply the desired aggregation function.

125. Pivoting and Reshaping Data

Pivoting and reshaping data to transform its structure.


# Pivoting data
pivot_table = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')

# Melting data
melted_df = pd.melt(df, id_vars=['Date'], value_vars=['Category1', 'Category2'])

Explanation:

Pivoting data reshapes it by converting columns into rows and vice versa. The pivot_table() function is used for this purpose. Melting data converts wide format data into long format by stacking multiple columns into a single column using the melt() function.

126. Time Series Analysis

Analyzing time series data using pandas.


# Converting to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Setting date as index
df.set_index('Date', inplace=True)

# Resampling
monthly_data = df.resample('M').sum()

Explanation:

Time series data analysis involves working with date and time data. Converting date strings to datetime objects, setting the date column as the index, and resampling data (e.g., aggregating daily data into monthly data) are common operations in time series analysis.

127. Plotting Data

Visualizing data using pandas plotting capabilities.


import matplotlib.pyplot as plt

# Line plot
df.plot(kind='line', x='Date', y='Value', title='Line Plot')

# Bar plot
df.plot(kind='bar', x='Category', y='Value', title='Bar Plot')

Explanation:

Pandas provides built-in plotting capabilities for visualizing data. Different types of plots, such as line plots, bar plots, histograms, and more, can be created using the plot() function. Matplotlib is commonly used as the backend for pandas plotting.

128. Handling Missing Data

Dealing with missing data in pandas DataFrames.


# Checking for missing values
missing_values = df.isnull().sum()

# Dropping rows with missing values
df_cleaned = df.dropna()

# Filling missing values
df_filled = df.fillna(0)

Explanation:

Handling missing data is crucial in data analysis. You can check for missing values using the isnull() function and then use methods like dropna() to remove rows with missing values or fillna() to fill missing values with a specified value.

129. Merging DataFrames

Combining multiple DataFrames using merge and join operations.


# Merging based on a common column
merged_df = pd.merge(df1, df2, on='ID')

# Inner join
inner_join_df = df1.merge(df2, on='ID', how='inner')

# Outer join
outer_join_df = df1.merge(df2, on='ID', how='outer')

Explanation:

Merging DataFrames involves combining them based on common columns. The merge() function can be used to perform different types of joins, such as inner join, outer join, left join, and right join. The how parameter specifies the type of join to perform.

130. Combining DataFrames

Concatenating multiple DataFrames vertically or horizontally.


# Concatenating vertically
concatenated_df = pd.concat([df1, df2])

# Concatenating horizontally
concatenated_horizontal_df = pd.concat([df1, df2], axis=1)

Explanation:

Combining DataFrames involves stacking them vertically or horizontally. The concat() function is used for this purpose. When concatenating horizontally, the axis parameter should be set to 1.

131. Grouping and Aggregation

Performing group-wise analysis and aggregations on DataFrame columns.


# Grouping by a column and calculating mean
grouped_df = df.groupby('Category')['Value'].mean()

# Applying multiple aggregations
aggregated_df = df.groupby('Category')['Value'].agg(['mean', 'sum', 'count'])

Explanation:

Grouping and aggregation are commonly used for summarizing data based on different categories. The groupby() function is used to group the DataFrame based on a specified column, and then aggregation functions like mean(), sum(), and count() can be applied to calculate statistics for each group.

132. Pivoting and Reshaping

Reshaping DataFrames using pivot tables and stacking/unstacking.


# Creating a pivot table
pivot_table = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')

# Stacking and unstacking
stacked_df = pivot_table.stack()
unstacked_df = stacked_df.unstack()

Explanation:

Pivot tables allow you to reshape your data by providing a new structure. The pivot_table() function creates a new DataFrame with rows as index, columns as columns, and values based on aggregation functions. Stacking and unstacking are used to reshape multi-level index DataFrames into a single-level index or vice versa.

133. Time Series Analysis

Working with time series data and performing time-based operations.


# Converting column to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Setting datetime column as index
df.set_index('Date', inplace=True)

# Resampling time series data
resampled_df = df.resample('W').mean()

Explanation:

Time series data involves working with data that is indexed by time. You can convert a column to a datetime format using pd.to_datetime() and then set it as the index of the DataFrame using set_index(). Resampling allows you to aggregate data based on a specified time frequency (e.g., weekly) using the resample() function.

134. Working with Text Data

Performing text-based operations on DataFrame columns.


# Converting column to uppercase
df['Name'] = df['Name'].str.upper()

# Extracting text using regular expressions
df['Digits'] = df['Text'].str.extract(r'(\d+)')

# Counting occurrences of a substring
df['Count'] = df['Text'].str.count('apple')

Explanation:

Text-based operations involve manipulating and extracting information from text data in DataFrame columns. You can convert text to uppercase using str.upper(), extract specific patterns using regular expressions and str.extract(), and count occurrences of substrings using str.count().

135. Working with Categorical Data

Dealing with categorical data and performing categorical operations.


# Converting column to categorical
df['Category'] = df['Category'].astype('category')

# Creating dummy variables
dummy_df = pd.get_dummies(df['Category'])

# Merging with original DataFrame
df = pd.concat([df, dummy_df], axis=1)

Explanation:

Categorical data represents data that belongs to a specific category. You can convert a column to a categorical type using astype() and create dummy variables using pd.get_dummies(). Dummy variables are used to represent categorical variables as binary columns, which can then be merged with the original DataFrame.

Comments

Archive

Contact Form

Send