KnowYourBasics
Get Started
🎓 Acharya Nagarjuna University – Distance Education

Diploma in Data Engineering

Industry-focused diploma bridging academic excellence with real-world skills

1 Year Duration
2 Semesters
35 Credits
12 Courses
📥 Download Complete Syllabus (DOCX)
← Back to Distance Education Programs

Program Overview

Comprehensive training in data engineering with industry-standard tools and practices

🎯 Program Objectives

  • Industry-focused learning experiences
  • Hands-on projects and real-world applications
  • Modern tools and technologies
  • Professional skill development
  • Career-ready graduates

📚 Learning Outcomes

  • Master core concepts and fundamentals
  • Apply advanced techniques and methodologies
  • Build production-ready applications
  • Work with industry-standard tools
  • Complete comprehensive capstone projects

🛠️ Technologies Covered

  • Modern programming languages and frameworks
  • Industry-standard tools and platforms
  • Cloud services and deployment
  • Best practices and methodologies
  • Production systems and MLOps

Course Structure

2 Semesters • 12 Courses • 35 Total Credits

S.No Course Code Course Title Credits
SEMESTER 1: Foundation & Fundamentals (16 Credits)
1 DADE101 Python for Data Engineering 3
2 DADE102 Python for Data Engineering Lab 2
3 DADE103 Data Analysis & Visualization 3
4 DADE104 Data Analysis & Visualization Lab 2
5 DADE105 Database Systems for Data Engineering 3
6 DADE106 Capstone Project 1 3
SEMESTER 2: Advanced & Production Systems (19 Credits)
7 DADE201 ETL/ELT Pipeline Development 3
8 DADE202 ETL/ELT Pipeline Development Lab 2
9 DADE203 Big Data Processing & Data Lakes 3
10 DADE204 Big Data Processing Lab 2
11 DADE205 Cloud Data Platforms & Data Warehousing 3
12 DADE206 Capstone Project 2 6

Detailed Syllabus

Click on a course to jump to its detailed syllabus

DADE101 DADE102 DADE103 DADE104 DADE105 DADE106 DADE201 DADE202 DADE203 DADE204 DADE205 DADE206
DADE101

Python for Data Engineering

📚 3 Credits 📅 Semester 1 📋 Prerequisite: Basic Python knowledge

Course Objectives

  1. To introduce Python programming fundamentals for data engineering tasks
  2. To teach essential Python libraries for data manipulation
  3. To develop skills in file handling and data processing
  4. To enable students to write efficient data processing scripts
  5. To build a strong foundation for advanced data engineering concepts
I

PYTHON FUNDAMENTALS FOR DATA ENGINEERING

Introduction to Python – Why Python for Data Engineering? – Python Installation and Setup – Python IDEs: Jupyter Notebook, VS Code, PyCharm – Running Python Scripts – Python Syntax Basics – Indentation and Code Structure – Comments and Documentation. Variables and Data Types – Python Variables – Numeric Types: int, float, complex – String Operations – Boolean Type – Type Conversion – Dynamic Typing – None Type. Data Structures – Lists: Creation, Indexing, Slicing, List Methods, List Comprehensions – Tuples: Immutability, Use Cases – Dictionaries: Key-Value Pairs, Dictionary Methods, Nested Dictionaries – Sets: Unique Elements, Set Operations.

📝 Illustrative Problems

Create lists and perform basic operations; Manipulate dictionaries for data storage; Use list comprehensions for data transformation; Handle nested data structures; Convert between different data types.

II

CONTROL FLOW AND FUNCTIONS

Control Flow Statements – Conditional Statements: if, elif, else – Comparison Operators – Logical Operators – Nested Conditionals – Looping Statements: for loops, while loops, range() function – Loop Control: break, continue, pass – Nested Loops. Functions – Function Definition and Calling – Function Parameters: Positional, Keyword, Default Arguments – Variable Scope: Local vs Global – Return Values – Lambda Functions – Map, Filter, Reduce Functions – Function Decorators (Introduction). Error Handling – Types of Errors: Syntax Errors, Runtime Errors, Logical Errors – Exception Handling: try, except, finally – Raising Exceptions – Custom Exceptions – Best Practices for Error Handling.

📝 Illustrative Problems

Write functions for data validation; Implement data filtering using filter() and list comprehensions; Handle file reading errors with exception handling; Create reusable data processing functions; Use lambda functions for quick transformations.

III

FILE HANDLING AND DATA FORMATS

File Operations – Opening Files: Read, Write, Append Modes – Reading Files: read(), readline(), readlines() – Writing Files: write(), writelines() – File Context Managers (with statement) – File Path Handling: os.path, pathlib. Working with CSV Files – CSV Module – Reading CSV Files: csv.reader(), csv.DictReader() – Writing CSV Files: csv.writer(), csv.DictWriter() – Handling CSV with pandas (Introduction) – CSV Data Cleaning. Working with JSON Files – JSON Module – json.load(), json.dumps() – Reading and Writing JSON – Nested JSON Structures – JSON Data Validation. Working with XML Files – XML Parsing with xml.etree.ElementTree – Reading XML Files – Extracting Data from XML – XML to Dictionary Conversion.

📝 Illustrative Problems

Read and process CSV files; Convert JSON data to Python dictionaries; Parse XML files and extract specific data; Clean and validate data from different file formats; Write processed data to files.

IV

ESSENTIAL PYTHON LIBRARIES

NumPy Fundamentals – Introduction to NumPy – Creating Arrays: array(), arange(), linspace(), zeros(), ones() – Array Operations: Indexing, Slicing, Reshaping – Array Mathematics: Element-wise Operations, Broadcasting – NumPy Functions: sum(), mean(), std(), min(), max(). Pandas Introduction – Introduction to pandas – Series: Creation, Indexing, Operations – DataFrames: Creation, Indexing, Selection – Reading Data: read_csv(), read_json(), read_excel() – Basic DataFrame Operations: head(), tail(), info(), describe(). Data Cleaning with Pandas – Handling Missing Values: isnull(), fillna(), dropna() – Data Type Conversion – Removing Duplicates – String Operations on DataFrames – Date and Time Handling.

📝 Illustrative Problems

Perform array operations using NumPy; Create DataFrames from various data sources; Clean datasets with missing values; Convert data types and handle date columns; Perform basic statistical operations.

V

DATA PROCESSING AND TRANSFORMATION

Data Transformation Techniques – Filtering Data: Boolean Indexing, Query Methods – Sorting Data: sort_values(), sort_index() – Grouping and Aggregation: groupby(), agg() – Merging and Joining: merge(), join(), concat() – Pivot Tables. Data Validation and Quality – Data Type Validation – Range Checking – Pattern Matching – Data Completeness Checks – Outlier Detection (Basic Concepts) – Data Quality Metrics. Working with Large Files – Chunking Large Files – Memory-Efficient Processing – Streaming Data Processing – File Compression: gzip, zipfile. Performance Optimization – Vectorization vs Loops – Using NumPy for Performance – Profiling Python Code – Best Practices for Data Processing.

📝 Illustrative Problems

Transform and aggregate data using pandas; Merge multiple datasets; Validate data quality; Process large files in chunks; Optimize data processing code for performance.

📖 Textbooks

  1. 1. Jake VanderPlas, "Python Data Science Handbook", O'Reilly Media, 2016
  2. 2. Wes McKinney, "Python for Data Analysis", 3rd Edition, O'Reilly Media, 2022
  3. 3. Mark Lutz, "Learning Python", 5th Edition, O'Reilly Media, 2013
DADE102

Python for Data Engineering Lab

📚 2 Credits 📅 Semester 1 📋 Prerequisite: DADE101

Course Objectives

  1. To provide hands-on experience in Python programming for data engineering
  2. To practice file handling and data format operations
  3. To develop skills in data cleaning and transformation
  4. To implement data processing scripts

📖 Textbooks

  1. 1. Jake VanderPlas, "Python Data Science Handbook", O'Reilly Media, 2016
  2. 2. Wes McKinney, "Python for Data Analysis", 3rd Edition, O'Reilly Media, 2022
DADE103

Data Analysis & Visualization

📚 3 Credits 📅 Semester 1 📋 Prerequisite: DADE101

Course Objectives

  1. To master data analysis techniques using pandas and NumPy
  2. To learn data visualization with matplotlib and seaborn
  3. To develop skills in exploratory data analysis (EDA)
  4. To understand statistical concepts for data analysis
  5. To create effective data visualizations
I

ADVANCED PANDAS FOR DATA ANALYSIS

DataFrame Operations – Advanced Indexing: loc, iloc, ix – Boolean Indexing – MultiIndex DataFrames – Reshaping Data: pivot, melt, stack, unstack – Combining Data: concat, merge, join – Handling Time Series Data. Data Aggregation and Grouping – GroupBy Operations – Aggregation Functions: sum, mean, count, min, max – Custom Aggregation Functions – Transform and Filter Operations – Window Functions: rolling, expanding. Data Cleaning and Preprocessing – Handling Missing Data: Detection, Imputation Strategies – Removing Duplicates – Data Type Conversion – String Manipulation: str methods, regex – Date and Time Parsing – Categorical Data Handling.

📝 Illustrative Problems

Perform complex grouping and aggregation operations; Reshape data for analysis; Clean and preprocess messy datasets; Handle time series data; Transform categorical variables.

II

STATISTICAL ANALYSIS WITH PYTHON

Descriptive Statistics – Measures of Central Tendency: Mean, Median, Mode – Measures of Dispersion: Variance, Standard Deviation, Range, IQR – Skewness and Kurtosis – Percentiles and Quartiles – Correlation Analysis. Statistical Tests – Hypothesis Testing Concepts – t-tests – Chi-square Tests – ANOVA (Introduction) – Using scipy.stats for Statistical Tests – Interpreting p-values. Probability Distributions – Common Distributions: Normal, Binomial, Poisson – Distribution Fitting – Random Sampling – Using numpy.random for Simulations.

📝 Illustrative Problems

Calculate descriptive statistics for datasets; Perform correlation analysis; Conduct hypothesis tests; Fit probability distributions; Generate random samples from distributions.

III

DATA VISUALIZATION WITH MATPLOTLIB

Matplotlib Fundamentals – Matplotlib Architecture: Figure, Axes, Plot – Creating Basic Plots: Line, Bar, Scatter, Histogram – Customizing Plots: Colors, Markers, Labels, Titles – Figure and Axes Customization – Saving Figures. Advanced Plotting – Subplots: Creating Multiple Plots – Plot Annotations: Text, Arrows, Shapes – Legends and Color Bars – Plot Styling: Themes, Grids – 3D Plotting (Introduction). Time Series Visualization – Plotting Time Series Data – Date Formatting – Multiple Time Series – Annotations and Highlights – Seasonal Decomposition Visualization.

📝 Illustrative Problems

Create various types of plots; Customize plot appearance; Create multi-panel visualizations; Visualize time series data; Add annotations and labels to plots.

IV

DATA VISUALIZATION WITH SEABORN

Seaborn Introduction – Seaborn vs Matplotlib – Seaborn Style and Themes – Setting Aesthetic Parameters – Color Palettes. Statistical Visualizations – Distribution Plots: distplot, kdeplot, rugplot – Categorical Plots: barplot, countplot, boxplot, violinplot – Relationship Plots: scatterplot, lineplot, relplot – Regression Plots: regplot, lmplot. Advanced Seaborn Plots – Heatmaps and Clustered Heatmaps – Pair Plots and Pair Grids – Facet Grids for Multi-dimensional Data – Customizing Seaborn Plots.

📝 Illustrative Problems

Create statistical visualizations with seaborn; Build heatmaps for correlation analysis; Create pair plots for exploratory analysis; Use facet grids for multi-dimensional visualization; Customize seaborn plots.

V

EXPLORATORY DATA ANALYSIS (EDA)

EDA Workflow – Understanding the Dataset – Data Profiling – Identifying Patterns and Anomalies – Feature Relationships – Data Quality Assessment. Visual EDA Techniques – Univariate Analysis: Distributions, Box Plots – Bivariate Analysis: Scatter Plots, Correlation Heatmaps – Multivariate Analysis: Pair Plots, Parallel Coordinates – Categorical Analysis: Count Plots, Cross-tabulations. Data Storytelling – Choosing Appropriate Visualizations – Creating Dashboards (Basic Concepts) – Presenting Insights – Best Practices for Data Visualization – Common Visualization Mistakes.

📝 Illustrative Problems

Perform complete EDA on a dataset; Create comprehensive visualizations; Identify data patterns and anomalies; Build a data analysis report; Present findings effectively.

📖 Textbooks

  1. 1. Wes McKinney, "Python for Data Analysis", 3rd Edition, O'Reilly Media, 2022
  2. 2. Jake VanderPlas, "Python Data Science Handbook", O'Reilly Media, 2016
  3. 3. Alberto Cairo, "The Functional Art", New Riders, 2012
DADE104

Data Analysis & Visualization Lab

📚 2 Credits 📅 Semester 1 📋 Prerequisite: DADE103

Course Objectives

  1. To practice data analysis techniques on real-world datasets
  2. To create various types of visualizations
  3. To perform exploratory data analysis
  4. To develop skills in data interpretation

📖 Textbooks

  1. 1. Wes McKinney, "Python for Data Analysis", 3rd Edition, O'Reilly Media, 2022
  2. 2. Jake VanderPlas, "Python Data Science Handbook", O'Reilly Media, 2016
DADE105

Database Systems for Data Engineering

📚 3 Credits 📅 Semester 1 📋 Prerequisite: DADE101

Course Objectives

  1. To understand database systems for data engineering applications
  2. To master SQL for data querying and manipulation
  3. To learn database design principles
  4. To work with both relational and NoSQL databases
  5. To understand database optimization for data engineering
I

RELATIONAL DATABASE FUNDAMENTALS

Introduction to Databases – Database Management Systems – Relational Model – Tables, Rows, Columns, Keys – Primary Keys, Foreign Keys – Database Normalization Basics – ACID Properties. SQL Fundamentals – DDL: CREATE TABLE, ALTER TABLE, DROP TABLE – DML: INSERT, UPDATE, DELETE – DQL: SELECT Statement – WHERE Clause – ORDER BY, GROUP BY, HAVING – Aggregate Functions: COUNT, SUM, AVG, MAX, MIN. Advanced SQL – JOINs: INNER, LEFT, RIGHT, FULL OUTER – Subqueries: Scalar, Correlated – Window Functions: ROW_NUMBER, RANK, DENSE_RANK – Common Table Expressions (CTEs) – Views and Materialized Views.

📝 Illustrative Problems

Create database schema for data engineering use case; Write complex queries with joins and subqueries; Use window functions for analytical queries; Create views for data access; Optimize query performance.

II

DATABASE DESIGN FOR DATA ENGINEERING

Data Modeling – Entity-Relationship Modeling – Normalization: 1NF, 2NF, 3NF – Denormalization for Performance – Star Schema and Snowflake Schema – Fact and Dimension Tables. Indexing and Performance – Types of Indexes: B-tree, Hash, Bitmap – Creating Indexes – Query Optimization – EXPLAIN Plans – Index Selection Strategies. Partitioning – Table Partitioning Concepts – Range Partitioning – Hash Partitioning – List Partitioning – Partition Pruning.

📝 Illustrative Problems

Design star schema for analytics; Create appropriate indexes; Analyze query execution plans; Implement table partitioning; Optimize database performance.

III

WORKING WITH POSTGRESQL AND MYSQL

PostgreSQL Features – PostgreSQL Installation and Setup – Advanced Data Types: JSON, JSONB, Arrays – Full-Text Search – PostgreSQL-specific Functions – Extensions and Extensibility. MySQL Features – MySQL Installation and Setup – Storage Engines: InnoDB, MyISAM – MySQL-specific Functions – Performance Schema – MySQL Optimization. Python Database Connectivity – Connecting to Databases: psycopg2, mysql-connector-python – Executing Queries – Parameterized Queries – Connection Pooling – pandas Integration: read_sql, to_sql.

📝 Illustrative Problems

Set up PostgreSQL/MySQL databases; Use advanced data types; Connect Python applications to databases; Execute queries from Python; Integrate pandas with databases.

IV

NOSQL DATABASES FOR DATA ENGINEERING

Introduction to NoSQL – NoSQL Database Types: Document, Key-Value, Column-Family, Graph – When to Use NoSQL – CAP Theorem – MongoDB Overview. MongoDB Fundamentals – MongoDB Installation – Documents and Collections – CRUD Operations – Query Operators – Aggregation Pipeline – Indexing in MongoDB. MongoDB with Python – PyMongo Library – Connecting to MongoDB – CRUD Operations from Python – Aggregation Pipeline in Python – Data Import/Export.

📝 Illustrative Problems

Design MongoDB schema; Perform CRUD operations; Write aggregation pipelines; Connect MongoDB with Python; Import/export data.

V

DATABASE INTEGRATION AND DATA PIPELINES

ETL Patterns with Databases – Extract from Multiple Sources – Transform Data – Load to Target Database – Incremental Loading – Change Data Capture (CDC) Concepts. Data Quality and Validation – Data Validation Rules – Constraint Checking – Data Profiling – Data Quality Metrics – Handling Data Quality Issues. Database Backup and Recovery – Backup Strategies – Point-in-Time Recovery – Replication Concepts – High Availability Basics.

📝 Illustrative Problems

Build ETL pipeline with databases; Implement data validation; Handle incremental loads; Set up database backups; Design replication strategy.

📖 Textbooks

  1. 1. Abraham Silberschatz, et al., "Database System Concepts", 7th Edition, McGraw Hill, 2020
  2. 2. Alan Beaulieu, "Learning SQL", 3rd Edition, O'Reilly Media, 2020
  3. 3. Kristina Chodorow, "MongoDB: The Definitive Guide", 3rd Edition, O'Reilly Media, 2019
DADE106

Capstone Project 1

📚 3 Credits 📅 Semester 1 📋 Prerequisite: DADE101, DADE103, DADE105

Course Objectives

  1. To apply Semester 1 concepts in a real-world data project
  2. To demonstrate mastery of Python, data analysis, and database operations
  3. To develop skills in end-to-end data project execution
  4. To practice data storytelling and presentation

📖 Textbooks

  1. 1. Wes McKinney, "Python for Data Analysis", 3rd Edition, O'Reilly Media, 2022
  2. 2. Jake VanderPlas, "Python Data Science Handbook", O'Reilly Media, 2016
DADE201

ETL/ELT Pipeline Development

📚 3 Credits 📅 Semester 2 📋 Prerequisite: DADE101, DADE103, DADE105

Course Objectives

  1. To understand ETL/ELT pipeline concepts and architecture
  2. To master Apache Airflow for workflow orchestration
  3. To learn data transformation techniques
  4. To implement error handling and data quality checks
  5. To build production-ready data pipelines
I

ETL/ELT FUNDAMENTALS

Introduction to ETL/ELT – What is ETL? – Extract, Transform, Load Process – ETL vs ELT – When to Use Each Approach – ETL Architecture Patterns – Data Pipeline Components. Extraction Phase – Data Sources: Databases, APIs, Files, Cloud Storage – Extraction Methods: Full Load, Incremental Load – Change Data Capture (CDC) Concepts – Handling Different Data Formats – Error Handling in Extraction. Transformation Phase – Data Cleaning Operations – Data Validation Rules – Data Type Conversions – Aggregations and Calculations – Data Enrichment – Joining and Merging Data.

📝 Illustrative Problems

Design ETL architecture for given scenario; Implement data extraction from multiple sources; Transform data according to business rules; Handle extraction errors; Validate transformed data.

II

BUILDING ETL PIPELINES WITH PYTHON

Python ETL Libraries – Introduction to ETL Tools – Building Custom ETL Scripts – Using pandas for Transformations – Handling Large Datasets – Memory Management. Pipeline Design Patterns – Sequential Pipelines – Parallel Processing – Pipeline Orchestration Basics – Dependency Management – Error Recovery Strategies. Data Quality in Pipelines – Data Quality Checks – Validation Rules – Data Profiling in Pipelines – Quality Metrics – Handling Quality Issues.

📝 Illustrative Problems

Build Python ETL script; Implement parallel processing; Add data quality checks; Handle pipeline errors; Optimize memory usage.

III

APACHE AIRFLOW FUNDAMENTALS

Introduction to Apache Airflow – What is Airflow? – Airflow Architecture: Scheduler, Executor, Webserver – DAGs (Directed Acyclic Graphs) – Tasks and Operators – Airflow Installation and Setup. Creating DAGs – DAG Definition – Task Definition – Task Dependencies – Scheduling DAGs – DAG Parameters and Configuration – DAG Best Practices. Airflow Operators – Built-in Operators: PythonOperator, BashOperator, SQLOperator – Custom Operators – Sensor Operators – Transfer Operators – Using Operators Effectively.

📝 Illustrative Problems

Create simple DAG; Define task dependencies; Schedule DAG execution; Use different operators; Handle task failures.

IV

ADVANCED AIRFLOW CONCEPTS

Task Dependencies and Control Flow – Complex Dependencies – Branching: BranchPythonOperator – Conditional Execution – Dynamic Task Generation – Task Groups. Variables, Connections, and Secrets – Airflow Variables – Managing Connections – Using Secrets – Environment Variables – Configuration Management. Monitoring and Logging – Airflow UI: Monitoring DAGs – Viewing Logs – Task Status Tracking – Alerting and Notifications – Performance Monitoring. Error Handling and Retries – Task Retries – Retry Policies – Error Notifications – Handling Failures – Recovery Strategies.

📝 Illustrative Problems

Implement complex task dependencies; Use variables and connections; Monitor pipeline execution; Configure retries; Set up alerting.

V

PRODUCTION PIPELINE PRACTICES

Pipeline Testing – Unit Testing DAGs – Integration Testing – Testing Data Transformations – Mocking External Dependencies – Test Data Management. Performance Optimization – Identifying Bottlenecks – Optimizing Transformations – Parallel Execution – Resource Management – Scaling Pipelines. Data Lineage and Documentation – Documenting Pipelines – Data Lineage Tracking – Pipeline Metadata – Version Control for DAGs – Best Practices. Deployment and CI/CD – Deploying DAGs – Version Control Integration – CI/CD for Data Pipelines – Environment Management – Rollback Strategies.

📝 Illustrative Problems

Write tests for pipelines; Optimize pipeline performance; Document pipeline architecture; Deploy pipelines; Implement CI/CD.

📖 Textbooks

  1. 1. Bas Harenslak, Julian Rutger de Ruiter, "Data Pipelines with Apache Airflow", Manning Publications, 2021
  2. 2. Marc Lamberti, "The Airflow Book", Leanpub, 2022
DADE202

ETL/ELT Pipeline Development Lab

📚 2 Credits 📅 Semester 2 📋 Prerequisite: DADE201

Course Objectives

  1. To implement ETL pipelines using Python
  2. To build workflows with Apache Airflow
  3. To practice data transformation techniques
  4. To implement error handling and monitoring

📖 Textbooks

  1. 1. Bas Harenslak, Julian Rutger de Ruiter, "Data Pipelines with Apache Airflow", Manning Publications, 2021
DADE203

Big Data Processing & Data Lakes

📚 3 Credits 📅 Semester 2 📋 Prerequisite: DADE201

Course Objectives

  1. To understand big data concepts and challenges
  2. To master Apache Spark for distributed data processing
  3. To learn data lake architecture and implementation
  4. To process large-scale datasets efficiently
  5. To understand distributed computing concepts
I

BIG DATA FUNDAMENTALS

Introduction to Big Data – What is Big Data? – 3Vs: Volume, Velocity, Variety – Big Data Challenges – Big Data Use Cases – Big Data Technologies Overview. Distributed Computing Concepts – Why Distributed Computing? – Distributed Systems Architecture – Scalability and Fault Tolerance – Data Partitioning – Shuffle Operations. Hadoop Ecosystem Overview – HDFS (Hadoop Distributed File System) – MapReduce Concepts – YARN (Yet Another Resource Negotiator) – Hadoop Ecosystem Tools – When to Use Hadoop.

📝 Illustrative Problems

Identify big data use cases; Design distributed system architecture; Understand data partitioning strategies; Compare big data technologies; Plan data storage strategy.

II

APACHE SPARK FUNDAMENTALS

Introduction to Apache Spark – What is Spark? – Spark vs MapReduce – Spark Architecture: Driver, Executors, Cluster Manager – Spark Components: Spark Core, Spark SQL, Spark Streaming – Spark Installation and Setup. Spark RDDs (Resilient Distributed Datasets) – RDD Concepts – Creating RDDs – RDD Operations: Transformations and Actions – Lazy Evaluation – RDD Persistence and Caching. Spark DataFrames – DataFrame API – Creating DataFrames – DataFrame Operations – DataFrame vs RDD – Catalyst Optimizer.

📝 Illustrative Problems

Create RDDs from data sources; Perform transformations and actions; Work with DataFrames; Optimize Spark operations; Use caching effectively.

III

SPARK SQL AND DATA PROCESSING

Spark SQL – Spark SQL Overview – Creating Tables and Views – SQL Queries in Spark – User-Defined Functions (UDFs) – Window Functions in Spark. Data Processing with Spark – Reading Data: CSV, JSON, Parquet, Avro – Writing Data – Data Transformations – Joins and Aggregations – Handling Large Datasets. Performance Optimization – Understanding Spark Execution Plan – Partitioning Strategies – Broadcast Variables – Accumulators – Tuning Spark Applications.

📝 Illustrative Problems

Write Spark SQL queries; Process large datasets; Optimize join operations; Use broadcast variables; Tune Spark performance.

IV

DATA LAKE ARCHITECTURE

Data Lake Concepts – What is a Data Lake? – Data Lake vs Data Warehouse – Data Lake Architecture – Storage Layer: Object Storage (S3, GCS) – Metadata Management. Data Lake Zones – Raw Zone (Bronze) – Processed Zone (Silver) – Curated Zone (Gold) – Zone Design Patterns – Data Lake Best Practices. Data Lake Implementation – Building Data Lake on Cloud – Data Ingestion Patterns – Data Organization: Partitioning, Bucketing – Schema Evolution – Data Governance Basics.

📝 Illustrative Problems

Design data lake architecture; Implement data lake zones; Organize data in data lake; Handle schema evolution; Plan data governance.

V

SPARK STREAMING AND ADVANCED TOPICS

Spark Streaming Concepts – Stream Processing vs Batch Processing – Spark Streaming Architecture – DStreams (Discretized Streams) – Structured Streaming – Stream Processing Patterns. Working with Streaming Data – Reading from Streams – Transformations on Streams – Writing Streams – Window Operations – Watermarking – Handling Late Data. Spark Integration – Spark with Airflow – Spark with Cloud Storage – Spark with Databases – Spark Cluster Management – Monitoring Spark Applications.

📝 Illustrative Problems

Process streaming data; Implement window operations; Handle late-arriving data; Integrate Spark with other tools; Monitor Spark applications.

📖 Textbooks

  1. 1. Jules Damji, et al., "Learning Spark", 2nd Edition, O'Reilly Media, 2020
  2. 2. Bill Chambers, Matei Zaharia, "Spark: The Definitive Guide", O'Reilly Media, 2018
DADE204

Big Data Processing Lab

📚 2 Credits 📅 Semester 2 📋 Prerequisite: DADE203

Course Objectives

  1. To implement Spark applications for big data processing
  2. To work with data lakes
  3. To process large-scale datasets
  4. To optimize Spark performance

📖 Textbooks

  1. 1. Jules Damji, et al., "Learning Spark", 2nd Edition, O'Reilly Media, 2020
DADE205

Cloud Data Platforms & Data Warehousing

📚 3 Credits 📅 Semester 2 📋 Prerequisite: DADE201

Course Objectives

  1. To understand cloud data platforms and services
  2. To master Google BigQuery for data warehousing
  3. To learn Snowflake data warehouse concepts
  4. To design data warehouse schemas
  5. To implement cloud-based data solutions
I

CLOUD DATA PLATFORMS OVERVIEW

Introduction to Cloud Data Platforms – Cloud Computing for Data – Major Cloud Providers: GCP, AWS, Azure – Cloud Data Services Overview – Benefits of Cloud Data Platforms. Google Cloud Platform Data Services – GCP Data Services Overview – Cloud Storage – Cloud SQL – Cloud Spanner – BigQuery Introduction – Cloud Dataflow – Pub/Sub. Data Platform Architecture – Cloud Data Architecture Patterns – Data Ingestion – Data Storage – Data Processing – Data Analytics – Cost Optimization.

📝 Illustrative Problems

Compare cloud data platforms; Design cloud data architecture; Choose appropriate cloud services; Plan data platform migration; Optimize cloud costs.

II

GOOGLE BIGQUERY FUNDAMENTALS

BigQuery Introduction – What is BigQuery? – BigQuery Architecture – Serverless Data Warehouse – BigQuery Console – BigQuery Pricing Model. BigQuery Basics – Creating Datasets and Tables – Loading Data into BigQuery – Querying Data: Standard SQL – BigQuery SQL Syntax – Query Results and Export. Advanced BigQuery – Partitioning Tables – Clustering Tables – BigQuery ML (Introduction) – User-Defined Functions – BigQuery BI Engine.

📝 Illustrative Problems

Create BigQuery datasets; Load data to BigQuery; Write complex queries; Optimize table structure; Use BigQuery ML.

III

BIGQUERY FOR DATA ENGINEERING

BigQuery Data Types – Supported Data Types – Nested and Repeated Data – JSON Data Handling – Array Operations – Struct Operations. BigQuery Performance Optimization – Query Optimization Techniques – Partitioning Strategies – Clustering Strategies – Query Caching – Best Practices. BigQuery Integration – BigQuery with Python – BigQuery API – Loading Data Programmatically – Streaming Inserts – BigQuery with Airflow.

📝 Illustrative Problems

Optimize BigQuery queries; Implement partitioning; Use BigQuery API; Integrate with Python; Load data from pipelines.

IV

SNOWFLAKE DATA WAREHOUSE

Snowflake Introduction – What is Snowflake? – Snowflake Architecture – Virtual Warehouses – Snowflake Pricing – Snowflake vs BigQuery. Snowflake Basics – Creating Accounts and Databases – Creating Tables – Loading Data – Querying Data – Snowflake SQL Syntax. Snowflake Advanced Features – Clustering Keys – Time Travel – Zero-Copy Cloning – Data Sharing – Snowflake Streams and Tasks.

📝 Illustrative Problems

Set up Snowflake account; Create tables and load data; Write queries; Use advanced features; Optimize performance.

V

DATA WAREHOUSING CONCEPTS

Data Warehouse Fundamentals – What is a Data Warehouse? – OLTP vs OLAP – Data Warehouse Architecture – ETL for Data Warehousing – Data Warehouse vs Data Lake. Dimensional Modeling – Star Schema – Snowflake Schema – Fact Tables – Dimension Tables – Slowly Changing Dimensions (SCDs). Data Warehouse Design – Designing Fact Tables – Designing Dimension Tables – Choosing Grain – Handling Historical Data – Best Practices.

📝 Illustrative Problems

Design star schema; Create fact and dimension tables; Handle slowly changing dimensions; Design data warehouse; Implement dimensional model.

📖 Textbooks

  1. 1. Valliappa Lakshmanan, Jordan Tigani, "Google BigQuery: The Definitive Guide", O'Reilly Media, 2019
  2. 2. Joyce Kay Avila, "Snowflake: The Definitive Guide", O'Reilly Media, 2022
  3. 3. Ralph Kimball, Margy Ross, "The Data Warehouse Toolkit", 3rd Edition, Wiley, 2013
DADE206

Capstone Project 2

📚 6 Credits 📅 Semester 2 📋 Prerequisite: All Semester 2 courses

Course Objectives

  1. To integrate all Semester 2 concepts in a comprehensive data engineering project
  2. To build production-ready data pipelines and data infrastructure
  3. To implement ETL pipelines, data lakes, and cloud data warehouses
  4. To work with industry mentors on real-world data challenges

📖 Textbooks

  1. 1. Bas Harenslak, Julian Rutger de Ruiter, "Data Pipelines with Apache Airflow", Manning Publications, 2021
  2. 2. Valliappa Lakshmanan, Jordan Tigani, "Google BigQuery: The Definitive Guide", O'Reilly Media, 2019