Blog Post

Database Design Principles: Building Scalable Data Structures

Designing efficient, maintainable databases from the ground up

Poor database design leads to performance bottlenecks, data inconsistencies, and maintenance nightmares. Well-designed databases scale gracefully, maintain data integrity, and make development easier. Let's explore the principles that create robust, efficient data structures.

Start with Requirements Analysis

Before writing a single CREATE TABLE statement, understand your data:

  • What entities does your system manage? (users, products, orders)
  • What attributes describe each entity?
  • What relationships exist between entities?
  • What queries will be most common?
  • What are your scalability requirements?

Document these requirements before designing schemas. Understanding access patterns influences design decisions significantly.

Normalization: Organizing Data Efficiently

Normalization eliminates redundancy and prevents data anomalies. The most important normal forms:

First Normal Form (1NF):

  • Each column contains atomic (indivisible) values
  • Each column contains values of a single type
  • Each column has a unique name
  • Order doesn't matter
-- Bad: Multiple values in one column
CREATE TABLE users (
  id INT PRIMARY KEY,
  name VARCHAR(100),
  phone_numbers VARCHAR(255)  -- "555-1234, 555-5678"
);

-- Good: Atomic values
CREATE TABLE users (
  id INT PRIMARY KEY,
  name VARCHAR(100)
);

CREATE TABLE phone_numbers (
  id INT PRIMARY KEY,
  user_id INT,
  phone VARCHAR(20),
  FOREIGN KEY (user_id) REFERENCES users(id)
);

Second Normal Form (2NF):

Must be in 1NF, plus: Non-key attributes must depend on the entire primary key (relevant for composite keys).

-- Bad: Partial dependency
CREATE TABLE order_items (
  order_id INT,
  product_id INT,
  product_name VARCHAR(100),  -- Depends only on product_id
  quantity INT,
  PRIMARY KEY (order_id, product_id)
);

-- Good: Full dependency
CREATE TABLE order_items (
  order_id INT,
  product_id INT,
  quantity INT,
  PRIMARY KEY (order_id, product_id),
  FOREIGN KEY (product_id) REFERENCES products(id)
);

CREATE TABLE products (
  id INT PRIMARY KEY,
  name VARCHAR(100)
);

Third Normal Form (3NF):

Must be in 2NF, plus: No transitive dependencies (non-key attributes shouldn't depend on other non-key attributes).

-- Bad: Transitive dependency
CREATE TABLE employees (
  id INT PRIMARY KEY,
  name VARCHAR(100),
  department_id INT,
  department_name VARCHAR(100),  -- Depends on department_id
  department_location VARCHAR(100)  -- Depends on department_id
);

-- Good: No transitive dependencies
CREATE TABLE employees (
  id INT PRIMARY KEY,
  name VARCHAR(100),
  department_id INT,
  FOREIGN KEY (department_id) REFERENCES departments(id)
);

CREATE TABLE departments (
  id INT PRIMARY KEY,
  name VARCHAR(100),
  location VARCHAR(100)
);

When to Denormalize

Normalization isn't always the answer. Denormalize strategically for performance:

  • Frequently joined data that rarely changes
  • Aggregate values accessed often (cached totals, counts)
  • Read-heavy applications where read speed trumps write complexity
-- Example: Denormalized order totals
CREATE TABLE orders (
  id INT PRIMARY KEY,
  user_id INT,
  total_amount DECIMAL(10,2),  -- Calculated but stored
  item_count INT,  -- Cached count
  created_at TIMESTAMP
);

Document denormalization decisions and maintain consistency through triggers or application logic.

Primary Keys: Choosing Wisely

Primary key decisions have long-term implications:

Auto-incrementing integers:

CREATE TABLE users (
  id INT AUTO_INCREMENT PRIMARY KEY,
  email VARCHAR(255) UNIQUE NOT NULL
);

Pros: Simple, efficient, small storage, good for joins
Cons: Sequential (predictable), not globally unique

UUIDs:

CREATE TABLE users (
  id CHAR(36) PRIMARY KEY,  -- or BINARY(16)
  email VARCHAR(255) UNIQUE NOT NULL
);

Pros: Globally unique, non-sequential, good for distributed systems
Cons: Larger storage, slower joins, index fragmentation

Natural keys:

CREATE TABLE countries (
  country_code CHAR(2) PRIMARY KEY,  -- 'US', 'CA', 'UK'
  name VARCHAR(100)
);

Use when: Natural, stable, unique identifiers exist
Avoid when: Values might change or aren't truly unique

Indexing Strategies

Indexes dramatically improve query performance but slow writes. Index strategically:

Always index:

  • Primary keys (automatic)
  • Foreign keys
  • Columns in WHERE clauses
  • Columns used for sorting (ORDER BY)
  • Columns in JOIN conditions
-- Single column index
CREATE INDEX idx_email ON users(email);

-- Composite index (order matters!)
CREATE INDEX idx_user_status ON orders(user_id, status);

-- Covering index includes extra columns
CREATE INDEX idx_user_orders 
  ON orders(user_id, created_at) 
  INCLUDE (total_amount);

-- Partial index (PostgreSQL)
CREATE INDEX idx_active_users 
  ON users(email) 
  WHERE is_active = true;

Index order in composite indexes:

  • Most selective column first (highest cardinality)
  • Equality conditions before range conditions
  • Consider query patterns

Data Types: Choose Appropriately

Correct data types save space and improve performance:

-- Use smallest adequate type
user_age TINYINT UNSIGNED  -- 0-255, not INT
is_active BOOLEAN  -- Not VARCHAR
price DECIMAL(10,2)  -- Not FLOAT for money
created_at TIMESTAMP  -- Not VARCHAR

-- VARCHAR vs TEXT
short_description VARCHAR(500)  -- Known max length
long_content TEXT  -- No practical limit

-- ENUM for fixed sets
status ENUM('pending', 'processing', 'completed', 'cancelled')

Relationships: Modeling Connections

One-to-Many: Most common relationship type

CREATE TABLE authors (
  id INT PRIMARY KEY,
  name VARCHAR(100)
);

CREATE TABLE books (
  id INT PRIMARY KEY,
  title VARCHAR(200),
  author_id INT,
  FOREIGN KEY (author_id) REFERENCES authors(id)
);

Many-to-Many: Requires junction table

CREATE TABLE students (
  id INT PRIMARY KEY,
  name VARCHAR(100)
);

CREATE TABLE courses (
  id INT PRIMARY KEY,
  name VARCHAR(100)
);

-- Junction table
CREATE TABLE enrollments (
  student_id INT,
  course_id INT,
  enrolled_date DATE,
  PRIMARY KEY (student_id, course_id),
  FOREIGN KEY (student_id) REFERENCES students(id),
  FOREIGN KEY (course_id) REFERENCES courses(id)
);

One-to-One: Less common, consider if separate table is necessary

CREATE TABLE users (
  id INT PRIMARY KEY,
  email VARCHAR(255)
);

-- Separate table for infrequently accessed data
CREATE TABLE user_profiles (
  user_id INT PRIMARY KEY,
  bio TEXT,
  avatar_url VARCHAR(500),
  FOREIGN KEY (user_id) REFERENCES users(id)
);

Constraints: Enforcing Data Integrity

Database-level constraints prevent invalid data:

CREATE TABLE products (
  id INT PRIMARY KEY,
  name VARCHAR(200) NOT NULL,
  price DECIMAL(10,2) NOT NULL CHECK (price >= 0),
  stock INT NOT NULL DEFAULT 0 CHECK (stock >= 0),
  category_id INT NOT NULL,
  sku VARCHAR(50) UNIQUE NOT NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  FOREIGN KEY (category_id) REFERENCES categories(id)
    ON DELETE RESTRICT  -- Prevent deletion if products exist
    ON UPDATE CASCADE   -- Update references if category ID changes
);

Common constraints:

  • NOT NULL: Requires value
  • UNIQUE: No duplicates allowed
  • CHECK: Custom validation rules
  • DEFAULT: Automatic value if not provided
  • FOREIGN KEY: Maintains referential integrity

Soft Deletes vs Hard Deletes

Consider soft deletes for important data:

CREATE TABLE users (
  id INT PRIMARY KEY,
  email VARCHAR(255),
  deleted_at TIMESTAMP NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Queries exclude soft-deleted records
SELECT * FROM users WHERE deleted_at IS NULL;

-- Create view for active users
CREATE VIEW active_users AS
  SELECT * FROM users WHERE deleted_at IS NULL;

Benefits: Recovery possible, audit trail, referential integrity maintained
Drawbacks: Queries more complex, larger database size

Timestamps and Audit Fields

Track changes with standard audit columns:

CREATE TABLE base_table (
  id INT PRIMARY KEY,
  -- ... other columns ...
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP 
    ON UPDATE CURRENT_TIMESTAMP,
  created_by INT,
  updated_by INT,
  FOREIGN KEY (created_by) REFERENCES users(id),
  FOREIGN KEY (updated_by) REFERENCES users(id)
);

Naming Conventions

Consistent naming improves maintainability:

  • Tables: Plural nouns (users, orders)
  • Columns: Lowercase with underscores (first_name, created_at)
  • Primary keys: id (simple and consistent)
  • Foreign keys: singular_table_id (user_id, product_id)
  • Indexes: idx_table_column (idx_users_email)
  • Constraints: type_table_column (fk_orders_user_id)

Common Design Mistakes

  • No primary key: Every table needs one
  • Using reserved words: Avoid names like order, user, date
  • Over-normalizing: Don't split every attribute into separate tables
  • Missing indexes: Foreign keys and frequent WHERE clauses need indexes
  • Wrong data types: Using VARCHAR for numbers, INT for prices
  • No constraints: Relying solely on application-level validation
  • Ignoring NULL semantics: NULL means "unknown," not zero or empty string

Documentation and Maintenance

Document your schema:

  • Create ER diagrams showing relationships
  • Document business rules and constraints
  • Maintain migration scripts for all changes
  • Use database comments for complex logic
-- Add comments to document purpose
COMMENT ON TABLE users IS 'Application users with authentication credentials';
COMMENT ON COLUMN users.last_login IS 'Timestamp of most recent successful login';

Good database design is about balance—normalize to prevent anomalies, denormalize for performance, index for common queries, and constrain for data integrity. Plan for scale from the start, but don't over-engineer for problems you don't have yet.