Database Design Principles: Building Scalable Data Structures

Blog Post

Database Design Principles: Building Scalable Data Structures

Designing efficient, maintainable databases from the ground up

Poor database design leads to performance bottlenecks, data inconsistencies, and maintenance nightmares. Well-designed databases scale gracefully, maintain data integrity, and make development easier. Let's explore the principles that create robust, efficient data structures.

Start with Requirements Analysis

Before writing a single CREATE TABLE statement, understand your data:

What entities does your system manage? (users, products, orders)
What attributes describe each entity?
What relationships exist between entities?
What queries will be most common?
What are your scalability requirements?

Document these requirements before designing schemas. Understanding access patterns influences design decisions significantly.

Normalization: Organizing Data Efficiently

Normalization eliminates redundancy and prevents data anomalies. The most important normal forms:

First Normal Form (1NF):

Each column contains atomic (indivisible) values
Each column contains values of a single type
Each column has a unique name
Order doesn't matter

-- Bad: Multiple values in one column
CREATE TABLE users (
  id INT PRIMARY KEY,
  name VARCHAR(100),
  phone_numbers VARCHAR(255)  -- "555-1234, 555-5678"
);

-- Good: Atomic values
CREATE TABLE users (
  id INT PRIMARY KEY,
  name VARCHAR(100)
);

CREATE TABLE phone_numbers (
  id INT PRIMARY KEY,
  user_id INT,
  phone VARCHAR(20),
  FOREIGN KEY (user_id) REFERENCES users(id)
);

Second Normal Form (2NF):

Must be in 1NF, plus: Non-key attributes must depend on the entire primary key (relevant for composite keys).

-- Bad: Partial dependency
CREATE TABLE order_items (
  order_id INT,
  product_id INT,
  product_name VARCHAR(100),  -- Depends only on product_id
  quantity INT,
  PRIMARY KEY (order_id, product_id)
);

-- Good: Full dependency
CREATE TABLE order_items (
  order_id INT,
  product_id INT,
  quantity INT,
  PRIMARY KEY (order_id, product_id),
  FOREIGN KEY (product_id) REFERENCES products(id)
);

CREATE TABLE products (
  id INT PRIMARY KEY,
  name VARCHAR(100)
);

Third Normal Form (3NF):

Must be in 2NF, plus: No transitive dependencies (non-key attributes shouldn't depend on other non-key attributes).

-- Bad: Transitive dependency
CREATE TABLE employees (
  id INT PRIMARY KEY,
  name VARCHAR(100),
  department_id INT,
  department_name VARCHAR(100),  -- Depends on department_id
  department_location VARCHAR(100)  -- Depends on department_id
);

-- Good: No transitive dependencies
CREATE TABLE employees (
  id INT PRIMARY KEY,
  name VARCHAR(100),
  department_id INT,
  FOREIGN KEY (department_id) REFERENCES departments(id)
);

CREATE TABLE departments (
  id INT PRIMARY KEY,
  name VARCHAR(100),
  location VARCHAR(100)
);

When to Denormalize

Normalization isn't always the answer. Denormalize strategically for performance:

Frequently joined data that rarely changes
Aggregate values accessed often (cached totals, counts)
Read-heavy applications where read speed trumps write complexity

-- Example: Denormalized order totals
CREATE TABLE orders (
  id INT PRIMARY KEY,
  user_id INT,
  total_amount DECIMAL(10,2),  -- Calculated but stored
  item_count INT,  -- Cached count
  created_at TIMESTAMP
);

Document denormalization decisions and maintain consistency through triggers or application logic.

Primary Keys: Choosing Wisely

Primary key decisions have long-term implications:

Auto-incrementing integers:

CREATE TABLE users (
  id INT AUTO_INCREMENT PRIMARY KEY,
  email VARCHAR(255) UNIQUE NOT NULL
);

Pros: Simple, efficient, small storage, good for joins
Cons: Sequential (predictable), not globally unique

UUIDs:

CREATE TABLE users (
  id CHAR(36) PRIMARY KEY,  -- or BINARY(16)
  email VARCHAR(255) UNIQUE NOT NULL
);

Pros: Globally unique, non-sequential, good for distributed systems
Cons: Larger storage, slower joins, index fragmentation

Natural keys:

CREATE TABLE countries (
  country_code CHAR(2) PRIMARY KEY,  -- 'US', 'CA', 'UK'
  name VARCHAR(100)
);

Use when: Natural, stable, unique identifiers exist
Avoid when: Values might change or aren't truly unique

Indexing Strategies

Indexes dramatically improve query performance but slow writes. Index strategically:

Always index:

Primary keys (automatic)
Foreign keys
Columns in WHERE clauses
Columns used for sorting (ORDER BY)
Columns in JOIN conditions

-- Single column index
CREATE INDEX idx_email ON users(email);

-- Composite index (order matters!)
CREATE INDEX idx_user_status ON orders(user_id, status);

-- Covering index includes extra columns
CREATE INDEX idx_user_orders 
  ON orders(user_id, created_at) 
  INCLUDE (total_amount);

-- Partial index (PostgreSQL)
CREATE INDEX idx_active_users 
  ON users(email) 
  WHERE is_active = true;

Index order in composite indexes:

Most selective column first (highest cardinality)
Equality conditions before range conditions
Consider query patterns

Data Types: Choose Appropriately

Correct data types save space and improve performance:

-- Use smallest adequate type
user_age TINYINT UNSIGNED  -- 0-255, not INT
is_active BOOLEAN  -- Not VARCHAR
price DECIMAL(10,2)  -- Not FLOAT for money
created_at TIMESTAMP  -- Not VARCHAR

-- VARCHAR vs TEXT
short_description VARCHAR(500)  -- Known max length
long_content TEXT  -- No practical limit

-- ENUM for fixed sets
status ENUM('pending', 'processing', 'completed', 'cancelled')

Relationships: Modeling Connections

One-to-Many: Most common relationship type

CREATE TABLE authors (
  id INT PRIMARY KEY,
  name VARCHAR(100)
);

CREATE TABLE books (
  id INT PRIMARY KEY,
  title VARCHAR(200),
  author_id INT,
  FOREIGN KEY (author_id) REFERENCES authors(id)
);

Many-to-Many: Requires junction table

CREATE TABLE students (
  id INT PRIMARY KEY,
  name VARCHAR(100)
);

CREATE TABLE courses (
  id INT PRIMARY KEY,
  name VARCHAR(100)
);

-- Junction table
CREATE TABLE enrollments (
  student_id INT,
  course_id INT,
  enrolled_date DATE,
  PRIMARY KEY (student_id, course_id),
  FOREIGN KEY (student_id) REFERENCES students(id),
  FOREIGN KEY (course_id) REFERENCES courses(id)
);

One-to-One: Less common, consider if separate table is necessary

CREATE TABLE users (
  id INT PRIMARY KEY,
  email VARCHAR(255)
);

-- Separate table for infrequently accessed data
CREATE TABLE user_profiles (
  user_id INT PRIMARY KEY,
  bio TEXT,
  avatar_url VARCHAR(500),
  FOREIGN KEY (user_id) REFERENCES users(id)
);

Constraints: Enforcing Data Integrity

Database-level constraints prevent invalid data:

CREATE TABLE products (
  id INT PRIMARY KEY,
  name VARCHAR(200) NOT NULL,
  price DECIMAL(10,2) NOT NULL CHECK (price >= 0),
  stock INT NOT NULL DEFAULT 0 CHECK (stock >= 0),
  category_id INT NOT NULL,
  sku VARCHAR(50) UNIQUE NOT NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  FOREIGN KEY (category_id) REFERENCES categories(id)
    ON DELETE RESTRICT  -- Prevent deletion if products exist
    ON UPDATE CASCADE   -- Update references if category ID changes
);

Common constraints:

NOT NULL: Requires value
UNIQUE: No duplicates allowed
CHECK: Custom validation rules
DEFAULT: Automatic value if not provided
FOREIGN KEY: Maintains referential integrity

Soft Deletes vs Hard Deletes

Consider soft deletes for important data:

CREATE TABLE users (
  id INT PRIMARY KEY,
  email VARCHAR(255),
  deleted_at TIMESTAMP NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Queries exclude soft-deleted records
SELECT * FROM users WHERE deleted_at IS NULL;

-- Create view for active users
CREATE VIEW active_users AS
  SELECT * FROM users WHERE deleted_at IS NULL;

Benefits: Recovery possible, audit trail, referential integrity maintained
Drawbacks: Queries more complex, larger database size

Timestamps and Audit Fields

Track changes with standard audit columns:

CREATE TABLE base_table (
  id INT PRIMARY KEY,
  -- ... other columns ...
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP 
    ON UPDATE CURRENT_TIMESTAMP,
  created_by INT,
  updated_by INT,
  FOREIGN KEY (created_by) REFERENCES users(id),
  FOREIGN KEY (updated_by) REFERENCES users(id)
);

Naming Conventions

Consistent naming improves maintainability:

Tables: Plural nouns (users, orders)
Columns: Lowercase with underscores (first_name, created_at)
Primary keys: id (simple and consistent)
Foreign keys: singular_table_id (user_id, product_id)
Indexes: idx_table_column (idx_users_email)
Constraints: type_table_column (fk_orders_user_id)

Common Design Mistakes

No primary key: Every table needs one
Using reserved words: Avoid names like order, user, date
Over-normalizing: Don't split every attribute into separate tables
Missing indexes: Foreign keys and frequent WHERE clauses need indexes
Wrong data types: Using VARCHAR for numbers, INT for prices
No constraints: Relying solely on application-level validation
Ignoring NULL semantics: NULL means "unknown," not zero or empty string

Documentation and Maintenance

Document your schema:

Create ER diagrams showing relationships
Document business rules and constraints
Maintain migration scripts for all changes
Use database comments for complex logic

-- Add comments to document purpose
COMMENT ON TABLE users IS 'Application users with authentication credentials';
COMMENT ON COLUMN users.last_login IS 'Timestamp of most recent successful login';

Good database design is about balance—normalize to prevent anomalies, denormalize for performance, index for common queries, and constrain for data integrity. Plan for scale from the start, but don't over-engineer for problems you don't have yet.

Back to Home

Elizabeth Mosley