Data Warehousing Interview Questions

Shobha Bhagwat
8 min readDec 31, 2021

DWH topics to prepare for Data Engineering Interviews

https://www.researchgate.net/figure/Data-warehouse-architecture_fig1_275068752

I worked as a data engineer for almost 8 years and during that span, attended interviews where I was asked many data warehousing (DWH) concepts. Hence I thought to jot down some of the DWH concepts that are important from data engineering (DE) interview perspective, based on my interview experience in the past. To aid in DE interview prep, I have listed few high difficulty SQL questions in my previous post.

A. Data Warehousing definition and characteristics —

A DWH is a copy of transactional data specifically structured for query and analysis. Another definition is that a DWH is a subject-oriented, integrated, time-variant and non-volatile collection of data to support management’s decision making process.

  1. Subject-oriented — A DWH can be used to analyze a particular subject area (E.g., sales)
  2. Integrated — DWH integrates data from different sources
  3. Time variant — Historical data is stored in DWH as against transactional systems where only most recent data is kept.
  4. Non-volatile — Once data is in DWH, it can’t change

DWH is useful for reporting, analysis, data mining, exploration of historical data etc. A datamart is a subset of a data warehouse focused on a particular line of business, department, or subject area.

B. Approaches for implementing DWH —

  1. Top-down approach — In this type of implementation, data from various sources is loaded into DWH. After that subsets of data are extracted and loaded into respective datamarts. The advantage of this approach is that the DWH is single source of truth for all datamarts and thus it is easier to implement consistency & standardization. Also it is easy to build new datamarts quickly from the DWH for specific use cases. The downside of this approach is that it may cost more and take more time for implementation than the other approaches, since the first step involves setting up and building the entire DWH.
  2. Bottom-up approach — In this approach, data from various sources is first loaded into specific datamarts. Data from various datamarts is then combined using conformed dimensions (i.e. dimensions which are shared by many facts in the different datamarts). The advantage is the speed of delivery since the DWH can be quickly integrated once the datamarts are built. The disadvantage of this approach is that all datamarts should use standard dimensions and facts to ensure integration and deliver a single version of the truth.
  3. Hybrid approach — this approach is a blend of top-down & bottom-up approaches and tries to capture both their advantages. The DWH is initially designed in third normal form and the first few important datamarts are designed in third normal form and loaded with data. Once these datamarts are loaded and data is approved by business, the DWH is backfilled from these datamarts thus reducing implementation time.
  4. Federated approach — this approach consists of integrating various DWH and datamarts to immediately meet ever-changing business needs, without any specific emphasis on any standard structure.

C. Types of DWH —

  1. Global warehouse (Enterprise DWH) — DWH is centralized and designed for entire enterprise. This type of DWH is time-consuming and costly to implement.
  2. Independent datamart architecture — stand-alone datamarts, that are located under one infrastructure (database), are controlled by different teams/work groups and built solely for their needs.
  3. Inter-connected DWH-datamart architecture — data marts are interconnected to provide enterprise-wide view.

D. Facts —

Facts are quantitative measures, metrics or facts of the business (e.g. — sales, revenue etc.). The different types of facts stored in DWH are —

  1. Additive — facts that can be summed up through all the dimensions in the fact table (e.g. sales fact)
  2. Semi-additive — facts that can be summed up for some of the dimensions in the fact table but not for others (e.g. daily balances fact can be summed up through customer dimension but not through time dimension)
  3. Non-additive — facts that can’t be summed up for any of the dimensions present in the fact table (e.g. facts which are in ratios, percentage form)
  4. Factless fact table — fact table that contains no measure or facts but can be used to get aggregated counts. (e.g. fact table containing only sale_date and product_key doesn’t contain any measure per se but can be used to get count of products sold over various periods of time).

Fact tables that contain aggregated facts are called summary tables.

E. Dimensions —

Dimensions are attributes about facts. The various types of dimensions are-

  1. Conformed dimension — dimension which means the same thing with every fact across the DWH (e.g. customer_id, load_date)
  2. Junk dimension — collection of random attributes that are unrelated to any particular fact. (e.g. a junk dimension table that has all possible combinations of age & marital status with a code associated with each combination. Thus instead of two columns only one column will be required to convey the same information)
  3. Degenerated dimension — dimension which is derived from a fact table and doesn’t have its own dimension table. (e.g. table containing only invoice number or ticket number etc.)
  4. Role-playing dimension — dimension which is often used for multiple purpose within same database (e.g. date dimension can be used as both trade date as well as settlement date)
  5. Inferred dimension — while loading a fact record, the corresponding dimension record may not be ready. One solution is to generate a surrogate dimension key with null for all other attributes. This is called inferred dimension.
  6. Static dimension — static dimension is not extracted from original data source but is created within the context of the DWH and can be loaded manually. (e.g. status codes)
  7. Shrunk dimension — shrunk dimension is a subset of a dimension’s attributes that apply to a higher level of summary. (e.g. — a month dimension would be a shrunk dimension for date dimension)

Slowly changing dimensions (SCD) store both current and historical values for an attribute. Since the dimensions don’t change frequently (as compared to facts), SCD implementation deals with how history is maintained for dimensions (e.g. change in marital status). The various types of SCD are —

  1. Type I — No history is maintained, new information overwrites existing information
  2. Type II — New record is inserted for every changed dimension value
One way to implement Type II SCD is to have start and end dates to indicate which is the latest value
Another way to implement Type II SCD is to have a flag to indicate which is the latest value

3. Type III — Original record is modified to include a new column which contains previous value. Only current and previous value is maintained.

Only current and previous value (limited history) is maintained in SCD Type III

F. Normalization —

Database normalization is the process of structuring a database in accordance with a series of rules in order to reduce data redundancy and improve data integrity. The various normal forms are —

  1. One normal form (1NF) — Eliminate repeated groups. Make a separate table for each set of related attributes and give each table a primary key (PK). Each field contains at most one value.
Original data
After 1NF normalization

2. Two normal form (2NF) — Eliminate redundant data. If an attribute depends on only part of a multi-valued key, remove it to a separate table.

Original table
Two tables created post 2NF normalization

3. Three normal form (3NF) — Eliminate columns not dependent on the primary key.

Original table structure
Table structure after 3NF normalization

Transitive functional dependency — When changing a non-key column might cause any of the other non-key column to change. (e.g. changing the name in a table might change the salutation column value)

G. DWH Modelling Schemas —

Schemas are logical representation of how facts and dimensions are linked in the DWH.

  1. Star schema — It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a central fact table. The fact table is connected to multiple dimension tables through primary key-foreign key relationship. The dimension tables are not connected to each other. This type of architecture is more suited for metrics/facts-heavy analysis. This schema works mainly for de-normalized data.
  2. Snowflake schema — This is an extension of star schema where the dimension tables are normalized into multiple tables. Thus there is relationship between the dimension tables. Hence data retrieval is slower. This structure is more suited for dimension-heavy analysis.
  3. Galaxy schema — This is a combination of star and snowflake schema. Here many facts and some common dimensions are linked together.

H. OLAP vs OLTP systems —

Online transaction processing system (OLTP) processes and stores data from transactions in real time. The tables here don’t maintain history, are smaller in size and are highly normalized for quick insert, delete and update operations. These are the backend tables of online applications.

Online Analytical Processing (OLAP) systems are de-normalized and process data in batches. They store history and are used to run complex queries for reporting and analysis. The tables here are de-normalized to support complex querying and data analysis.

Another common question encountered in interviews is designing of data marts/ DWH for specific use cases. I was once asked to design the data model for a redBus kind of bus seat booking system, in a FAANG interview. How to solve these type of questions is detailed in this medium article I found online.

Some of the other questions that I have encountered are —

  1. how to load incremental data (maintaining time-based watermarks, checking source data changes and calculating incremental delta for loading, etc.),
  2. how to load large files into a DWH through batch system (common data cleaning steps, checking data integrity, null value handling, date formatting, staging data and then moving to master table based on specific DML strategy and maintaining PK- FK relationship while loading data)
  3. various ETL strategies (for source file placed at ftp server, sourcing data from transactional tables without locking, duplicate data handling etc.)
  4. various data archival and staging strategies for different use cases
  5. how to handle DDL operations while maintaining model integrity

There are many more questions and topics under data warehousing and data modelling. I have covered only those ones which have been asked to me over the years. Hope this is helpful!

--

--

Shobha Bhagwat

Analytics Manager @ Gojek || Data Science & Analytics/ Data Engineering/ BI || Product Enthusiast || https://www.linkedin.com/in/shobha-bhagwat-6a463357/