Database Design Principles: Building Scalable Data Structures
Designing efficient, maintainable databases from the ground up
Poor database design leads to performance bottlenecks, data inconsistencies, and maintenance nightmares. Well-designed databases scale gracefully, maintain data integrity, and make development easier. Let's explore the principles that create robust, efficient data structures.
Start with Requirements Analysis
Before writing a single CREATE TABLE statement, understand your data:
- What entities does your system manage? (users, products, orders)
- What attributes describe each entity?
- What relationships exist between entities?
- What queries will be most common?
- What are your scalability requirements?
Document these requirements before designing schemas. Understanding access patterns influences design decisions significantly.
Normalization: Organizing Data Efficiently
Normalization eliminates redundancy and prevents data anomalies. The most important normal forms:
First Normal Form (1NF):
- Each column contains atomic (indivisible) values
- Each column contains values of a single type
- Each column has a unique name
- Order doesn't matter
-- Bad: Multiple values in one column
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(100),
phone_numbers VARCHAR(255) -- "555-1234, 555-5678"
);
-- Good: Atomic values
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE phone_numbers (
id INT PRIMARY KEY,
user_id INT,
phone VARCHAR(20),
FOREIGN KEY (user_id) REFERENCES users(id)
);
Second Normal Form (2NF):
Must be in 1NF, plus: Non-key attributes must depend on the entire primary key (relevant for composite keys).
-- Bad: Partial dependency
CREATE TABLE order_items (
order_id INT,
product_id INT,
product_name VARCHAR(100), -- Depends only on product_id
quantity INT,
PRIMARY KEY (order_id, product_id)
);
-- Good: Full dependency
CREATE TABLE order_items (
order_id INT,
product_id INT,
quantity INT,
PRIMARY KEY (order_id, product_id),
FOREIGN KEY (product_id) REFERENCES products(id)
);
CREATE TABLE products (
id INT PRIMARY KEY,
name VARCHAR(100)
);
Third Normal Form (3NF):
Must be in 2NF, plus: No transitive dependencies (non-key attributes shouldn't depend on other non-key attributes).
-- Bad: Transitive dependency
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(100),
department_id INT,
department_name VARCHAR(100), -- Depends on department_id
department_location VARCHAR(100) -- Depends on department_id
);
-- Good: No transitive dependencies
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(100),
department_id INT,
FOREIGN KEY (department_id) REFERENCES departments(id)
);
CREATE TABLE departments (
id INT PRIMARY KEY,
name VARCHAR(100),
location VARCHAR(100)
);
When to Denormalize
Normalization isn't always the answer. Denormalize strategically for performance:
- Frequently joined data that rarely changes
- Aggregate values accessed often (cached totals, counts)
- Read-heavy applications where read speed trumps write complexity
-- Example: Denormalized order totals
CREATE TABLE orders (
id INT PRIMARY KEY,
user_id INT,
total_amount DECIMAL(10,2), -- Calculated but stored
item_count INT, -- Cached count
created_at TIMESTAMP
);
Document denormalization decisions and maintain consistency through triggers or application logic.
Primary Keys: Choosing Wisely
Primary key decisions have long-term implications:
Auto-incrementing integers:
CREATE TABLE users (
id INT AUTO_INCREMENT PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL
);
Pros: Simple, efficient, small storage, good for joins
Cons: Sequential (predictable), not globally unique
UUIDs:
CREATE TABLE users (
id CHAR(36) PRIMARY KEY, -- or BINARY(16)
email VARCHAR(255) UNIQUE NOT NULL
);
Pros: Globally unique, non-sequential, good for distributed systems
Cons: Larger storage, slower joins, index fragmentation
Natural keys:
CREATE TABLE countries (
country_code CHAR(2) PRIMARY KEY, -- 'US', 'CA', 'UK'
name VARCHAR(100)
);
Use when: Natural, stable, unique identifiers exist
Avoid when: Values might change or aren't truly unique
Indexing Strategies
Indexes dramatically improve query performance but slow writes. Index strategically:
Always index:
- Primary keys (automatic)
- Foreign keys
- Columns in WHERE clauses
- Columns used for sorting (ORDER BY)
- Columns in JOIN conditions
-- Single column index
CREATE INDEX idx_email ON users(email);
-- Composite index (order matters!)
CREATE INDEX idx_user_status ON orders(user_id, status);
-- Covering index includes extra columns
CREATE INDEX idx_user_orders
ON orders(user_id, created_at)
INCLUDE (total_amount);
-- Partial index (PostgreSQL)
CREATE INDEX idx_active_users
ON users(email)
WHERE is_active = true;
Index order in composite indexes:
- Most selective column first (highest cardinality)
- Equality conditions before range conditions
- Consider query patterns
Data Types: Choose Appropriately
Correct data types save space and improve performance:
-- Use smallest adequate type
user_age TINYINT UNSIGNED -- 0-255, not INT
is_active BOOLEAN -- Not VARCHAR
price DECIMAL(10,2) -- Not FLOAT for money
created_at TIMESTAMP -- Not VARCHAR
-- VARCHAR vs TEXT
short_description VARCHAR(500) -- Known max length
long_content TEXT -- No practical limit
-- ENUM for fixed sets
status ENUM('pending', 'processing', 'completed', 'cancelled')
Relationships: Modeling Connections
One-to-Many: Most common relationship type
CREATE TABLE authors (
id INT PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE books (
id INT PRIMARY KEY,
title VARCHAR(200),
author_id INT,
FOREIGN KEY (author_id) REFERENCES authors(id)
);
Many-to-Many: Requires junction table
CREATE TABLE students (
id INT PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE courses (
id INT PRIMARY KEY,
name VARCHAR(100)
);
-- Junction table
CREATE TABLE enrollments (
student_id INT,
course_id INT,
enrolled_date DATE,
PRIMARY KEY (student_id, course_id),
FOREIGN KEY (student_id) REFERENCES students(id),
FOREIGN KEY (course_id) REFERENCES courses(id)
);
One-to-One: Less common, consider if separate table is necessary
CREATE TABLE users (
id INT PRIMARY KEY,
email VARCHAR(255)
);
-- Separate table for infrequently accessed data
CREATE TABLE user_profiles (
user_id INT PRIMARY KEY,
bio TEXT,
avatar_url VARCHAR(500),
FOREIGN KEY (user_id) REFERENCES users(id)
);
Constraints: Enforcing Data Integrity
Database-level constraints prevent invalid data:
CREATE TABLE products (
id INT PRIMARY KEY,
name VARCHAR(200) NOT NULL,
price DECIMAL(10,2) NOT NULL CHECK (price >= 0),
stock INT NOT NULL DEFAULT 0 CHECK (stock >= 0),
category_id INT NOT NULL,
sku VARCHAR(50) UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (category_id) REFERENCES categories(id)
ON DELETE RESTRICT -- Prevent deletion if products exist
ON UPDATE CASCADE -- Update references if category ID changes
);
Common constraints:
NOT NULL: Requires valueUNIQUE: No duplicates allowedCHECK: Custom validation rulesDEFAULT: Automatic value if not providedFOREIGN KEY: Maintains referential integrity
Soft Deletes vs Hard Deletes
Consider soft deletes for important data:
CREATE TABLE users (
id INT PRIMARY KEY,
email VARCHAR(255),
deleted_at TIMESTAMP NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Queries exclude soft-deleted records
SELECT * FROM users WHERE deleted_at IS NULL;
-- Create view for active users
CREATE VIEW active_users AS
SELECT * FROM users WHERE deleted_at IS NULL;
Benefits: Recovery possible, audit trail, referential integrity maintained
Drawbacks: Queries more complex, larger database size
Timestamps and Audit Fields
Track changes with standard audit columns:
CREATE TABLE base_table (
id INT PRIMARY KEY,
-- ... other columns ...
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
ON UPDATE CURRENT_TIMESTAMP,
created_by INT,
updated_by INT,
FOREIGN KEY (created_by) REFERENCES users(id),
FOREIGN KEY (updated_by) REFERENCES users(id)
);
Naming Conventions
Consistent naming improves maintainability:
- Tables: Plural nouns (
users,orders) - Columns: Lowercase with underscores (
first_name,created_at) - Primary keys:
id(simple and consistent) - Foreign keys:
singular_table_id(user_id,product_id) - Indexes:
idx_table_column(idx_users_email) - Constraints:
type_table_column(fk_orders_user_id)
Common Design Mistakes
- No primary key: Every table needs one
- Using reserved words: Avoid names like
order,user,date - Over-normalizing: Don't split every attribute into separate tables
- Missing indexes: Foreign keys and frequent WHERE clauses need indexes
- Wrong data types: Using VARCHAR for numbers, INT for prices
- No constraints: Relying solely on application-level validation
- Ignoring NULL semantics: NULL means "unknown," not zero or empty string
Documentation and Maintenance
Document your schema:
- Create ER diagrams showing relationships
- Document business rules and constraints
- Maintain migration scripts for all changes
- Use database comments for complex logic
-- Add comments to document purpose
COMMENT ON TABLE users IS 'Application users with authentication credentials';
COMMENT ON COLUMN users.last_login IS 'Timestamp of most recent successful login';
Good database design is about balance—normalize to prevent anomalies, denormalize for performance, index for common queries, and constrain for data integrity. Plan for scale from the start, but don't over-engineer for problems you don't have yet.