DISTINCT Operations

This document explains DISTINCT operations in Stoolap based on the implementation and test files.

Overview

Based on test files (/test/distinct_test.go), Stoolap supports the DISTINCT keyword to eliminate duplicate rows from query results. This feature is essential for retrieving unique values or combinations of values from tables.

Syntax

Stoolap supports these DISTINCT syntax patterns:

-- Basic DISTINCT on single column
SELECT DISTINCT column FROM table;

-- DISTINCT on multiple columns
SELECT DISTINCT column1, column2, ... FROM table;

-- DISTINCT with ORDER BY
SELECT DISTINCT column FROM table ORDER BY column;

-- COUNT with DISTINCT
SELECT COUNT(DISTINCT column) FROM table;

Examples from Test Files

Basic DISTINCT on Single Column

-- Create a test table
CREATE TABLE products (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    category TEXT,
    price FLOAT
);

-- Insert sample data with duplicate categories
INSERT INTO products (id, name, category, price) VALUES
(1, 'Laptop', 'Electronics', 999.99),
(2, 'Smartphone', 'Electronics', 499.99),
(3, 'Headphones', 'Electronics', 99.99),
(4, 'T-shirt', 'Clothing', 19.99),
(5, 'Jeans', 'Clothing', 49.99),
(6, 'Novel', 'Books', 14.99),
(7, 'Textbook', 'Books', 79.99);

-- Select distinct categories
SELECT DISTINCT category FROM products;

-- Result: 3 rows
-- Electronics
-- Clothing
-- Books

DISTINCT on Multiple Columns

-- Create a table with region information
CREATE TABLE sales (
    id INTEGER PRIMARY KEY,
    product TEXT,
    category TEXT,
    region TEXT,
    amount FLOAT
);

-- Insert data with duplicate combinations
INSERT INTO sales (id, product, category, region, amount) VALUES
(1, 'Laptop', 'Electronics', 'North', 999.99),
(2, 'Smartphone', 'Electronics', 'South', 499.99),
(3, 'Headphones', 'Electronics', 'North', 99.99),
(4, 'T-shirt', 'Clothing', 'East', 19.99),
(5, 'Jeans', 'Clothing', 'West', 49.99),
(6, 'T-shirt', 'Clothing', 'North', 19.99),
(7, 'Novel', 'Books', 'South', 14.99);

-- Select distinct category and region combinations
SELECT DISTINCT category, region FROM sales;

-- Result: 6 rows (unique combinations)
-- Electronics, North
-- Electronics, South
-- Clothing, East
-- Clothing, West
-- Clothing, North
-- Books, South

DISTINCT with ORDER BY

-- Select distinct regions ordered alphabetically
SELECT DISTINCT region FROM sales ORDER BY region;

-- Result: 4 rows in alphabetical order
-- East
-- North
-- South
-- West

COUNT with DISTINCT

-- Count distinct categories
SELECT COUNT(DISTINCT category) FROM sales;

-- Result: 3 (Electronics, Clothing, Books)

How DISTINCT Works in Stoolap

Stoolap implements DISTINCT operations through the following mechanism:

The query executor detects the DISTINCT keyword during SQL parsing
After retrieving the base result set, but before applying ORDER BY or LIMIT, the result is filtered for uniqueness
A map tracks unique rows to eliminate duplicates
For multiple columns, the uniqueness is based on the combination of all column values

DISTINCT with NULL Values

NULL values are considered distinct values in DISTINCT operations:

-- Create a table with NULL values
CREATE TABLE null_test (
    id INTEGER PRIMARY KEY,
    value TEXT
);

-- Insert data with NULLs
INSERT INTO null_test (id, value) VALUES
(1, 'A'),
(2, 'B'),
(3, NULL),
(4, 'A'),
(5, NULL);

-- Select distinct values
SELECT DISTINCT value FROM null_test;

-- Result: 3 rows
-- A
-- B
-- NULL

Performance Considerations

Based on the implementation:

Memory Usage: DISTINCT operations require memory to track unique rows
Large Result Sets: For very large tables, DISTINCT can be memory-intensive
Column Count: DISTINCT on multiple columns requires more processing than single columns
Data Cardinality: Performance depends on the ratio of unique values to total rows

Best Practices

Be Selective: Only use DISTINCT when you actually need to remove duplicates
Consider Alternatives: For counting unique values, unique indexes may be more efficient
Column Order: For multi-column DISTINCT, put highest-cardinality columns first if you’re also using ORDER BY
Index Usage: Properly indexed columns can improve DISTINCT operations

Implementation Details

Internally, Stoolap:

Creates a map to track unique rows
Generates a unique key for each row based on its values
Only passes through rows that haven’t been seen before
Optimizes COUNT(DISTINCT) operations with specialized processing

These optimizations ensure that DISTINCT operations are both correct and reasonably efficient for most use cases.