Author(s): Sainath Udata
Originally published on Towards AI.
Writing SQL is an art, but not everyone enjoys Canvas. For many people, the syntax of a complex JOIN Or nested subqueries are a hindrance to getting the data they need. In our fast-paced world, not everyone has time to become a database administrator – and with the power of large language models (LLM), you don’t have to.
Instead of teaching you how to write queries, I’m going to show you how to build a tool that will write them for you. We’ll use a Python-based extractor to feed your database schema into LLM, which will transform a simple English query into a functional SQL statement.
The Tech Stack
we will use MS SQL Server For this walkthrough, but the logic applies to any relational database (Postgres, MySQL, etc.) as long as you use the correct connector library.
- dotenv: To securely load your environment variables and API keys.
- mssql-python: Our bridge to MS SQL Server.
- openai: To connect to the LLM of your choice (for example, GPT-4).
- sqlparse: To clean and format the SQL generated by LLM.
- ultraviolet: A fast-paced Python package manager to handle our environment.
setting up your environment
To get started, initialize your project using UV or your favorite manager.
# pyproject.toml
(project)
name = "simple_sql_query_generator"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = (
"dotenv>=0.9.9",
"mssql-python>=1.3.0",
"openai>=2.21.0",
)
create pyproject.toml Create file with same content as above and run command uv sync To install these dependencies and create your virtual environment.
make also .env The file contains all your keys or environment details,
# .env
DB_HOST=
DB_PORT=
DB_NAME=
DB_USER=
DB_PASSWORD=OLLAMA_API_KEY=ollama
OLLAMA_BASE_URL="http://localhost:11434/v1"
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GROK_API_KEY=
GROQ_API_KEY=
preparing a playground
No LLM can guess what’s in your database. It needs context: table names, columns and relationships. To demonstrate, let’s create a classic customers And Order landscape.
-- sql_script.sql
-- 1. Create Customers Table
CREATE TABLE customers (
customer_id INT NOT NULL PRIMARY KEY,
name NVARCHAR(255) NULL,
email NVARCHAR(255) NULL,
created_at DATETIME2 NULL DEFAULT GETDATE()
);-- 2. Create Orders Table
CREATE TABLE orders (
order_id INT NOT NULL PRIMARY KEY,
customer_id INT NULL,
order_date DATE NULL,
total_amount DECIMAL(18, 2) NULL,
-- Define Foreign Key relationship
CONSTRAINT FK_Orders_Customers FOREIGN KEY (customer_id)
REFERENCES customers(customer_id)
);
-- 3. Insert Sample Records
INSERT INTO customers (customer_id, name, email, created_at)
VALUES
(1, 'Alice Johnson', 'alice@example.com', '2025-01-15 10:30:00'),
(2, 'Bob Smith', 'bob@example.com', '2025-02-20 14:45:00'),
(3, 'Charlie Brown', 'charlie@example.com', '2025-03-05 09:15:00'),
(4, 'Diana Prince', 'diana@themyscira.com', '2025-07-12 08:00:00'),
(5, 'Edward Nigma', 'riddler@gotham.com', '2025-08-01 11:20:00'),
(6, 'Fiona Gallagher', 'fiona@southside.com', '2025-09-14 16:30:00'),
(7, 'George Costanza', 'art_vandelay@latex.com', '2025-10-10 12:00:00'),
(8, 'Hannah Abbott', 'hannah@hufflepuff.edu', '2025-11-22 09:45:00'),
(9, 'Ian Malcolm', 'chaos@jurassic.org', '2025-12-05 13:10:00'),
(10, 'Julia Child', 'cooking@french.com', '2026-01-10 15:55:00');
INSERT INTO orders (order_id, customer_id, order_date, total_amount)
VALUES
(101, 1, '2025-04-01', 150.50),
(102, 1, '2025-04-15', 89.99),
(103, 2, '2025-05-20', 210.00),
(104, 3, '2025-06-10', 45.00),
(105, 4, '2025-07-20', 500.00),
(106, 5, '2025-08-05', 12.50),
(107, 5, '2025-08-15', 35.75),
(108, 6, '2025-09-20', 99.99),
(109, 1, '2025-10-01', 25.00), -- Customer 1 again
(110, 2, '2025-11-05', 300.25), -- Customer 2 again
(111, 7, '2025-12-01', 15.00),
(112, 4, '2025-12-15', 750.00), -- High value order
(113, 8, '2026-01-05', 62.40),
(114, 10, '2026-01-20', 120.00),
(115, 1, '2026-02-01', 10.00), -- Recent order
(116, 5, '2026-02-10', 55.00);
The “secret sauce”: automated schema extraction
For LLM to generate valid SQL, it needs more than just table names; It requires understanding the “connective tissue” of your database – primary keys, foreign keys, and data types.
Although this is not a SQL class, we use some standard T-SQL queries to “teach” LLM our database structure. by inquiring INFORMATION_SCHEMAWe can create a Markdown map that looks like this:
- Tables: we bring everything
BASE TABLEEntries to avoid system views. - column: To help the LLM understand what data it is dealing with, we capture types and nullability.
- Keys: We identify primary and foreign keys, so LLM knows exactly how to
JOINTables without guessing.
## Table: customers
Columns:
- customer_id: int NOT NULL (PRIMARY KEY)
- name: nvarchar NULL
- email: nvarchar NULL
- created_at: datetime2 NULL
## Table: orders
Columns:
- order_id: int NOT NULL (PRIMARY KEY)
- customer_id: int NULL
- order_date: date NULL
- total_amount: decimal NULL
Foreign Keys:
- customer_id → customers.customer_id
We wrap the queries in a Python class that does the heavy lifting. This class connects to your DB, runs the metadata query, and outputs a clean Markdown string.
# schema_extractor.py
from mssql_python import connect
from typing import Dict, List, Optionalclass SchemaExtractor:
def __init__(self, connection_string):
# Using the mssql-python connect method
if isinstance(connection_string, dict):
# Unpacks dict keys into: connect(SERVER='...', DATABASE='...')
self.conn = connect(**connection_string)
else:
# Treats it as a standard connection string
self.conn = connect(connection_string)
self.cursor = self.conn.cursor()
def get_tables(self, include_tables: Optional(List(str)) = None) -> List(str):
"""Get specific or all table names in the database."""
if include_tables:
# Format list for T-SQL IN clause: 'table1', 'table2'
placeholders = ", ".join((f"'{t}'" for t in include_tables))
filter_clause = f"AND table_name IN ({placeholders})"
else:
filter_clause = ""
query = f"""
SELECT table_name
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
AND table_schema = 'dbo'
{filter_clause}
ORDER BY table_name;
"""
self.cursor.execute(query)
return (row(0) for row in self.cursor.fetchall())
def get_table_schema(self, table_name: str) -> Dict:
"""Get detailed schema for a specific MSSQL table."""
# 1. Get Columns
query = """
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = ?
ORDER BY ordinal_position;
"""
self.cursor.execute(query, (table_name,))
columns = ()
for row in self.cursor.fetchall():
columns.append({
'name': row(0), # Accessing by index
'type': row(1),
'nullable': row(2) == 'YES',
'default': row(3)
})
# 2. Get Primary Keys (Improved Query)
pk_query = """
SELECT COLUMN_NAME
FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE
WHERE OBJECTPROPERTY(OBJECT_ID(CONSTRAINT_SCHEMA + '.' + CONSTRAINT_NAME), 'IsPrimaryKey') = 1
AND TABLE_NAME = ?;
"""
self.cursor.execute(pk_query, (table_name,))
# fetchall() returns a list of tuples like (('id',), ('other_pk',))
primary_keys = (row(0) for row in self.cursor.fetchall())
return {
'table_name': table_name,
'columns': columns,
'primary_keys': primary_keys
}
def get_foreign_keys(self, table_name: str) -> List(Dict):
"""Get foreign key relationships in MSSQL."""
query = """
SELECT
cp.name AS column_name,
tr.name AS referenced_table,
cr.name AS referenced_column
FROM sys.foreign_keys AS fk
INNER JOIN sys.foreign_key_columns AS fkc ON fk.object_id = fkc.constraint_object_id
INNER JOIN sys.tables AS tp ON fkc.parent_object_id = tp.object_id
INNER JOIN sys.columns AS cp ON fkc.parent_object_id = cp.object_id AND fkc.parent_column_id = cp.column_id
INNER JOIN sys.tables AS tr ON fkc.referenced_object_id = tr.object_id
INNER JOIN sys.columns AS cr ON fkc.referenced_object_id = cr.object_id AND fkc.referenced_column_id = cr.column_id
WHERE tp.name = ?;
"""
self.cursor.execute(query, (table_name,))
# Accessing each column by its index in the tuple
return ({
'column': row(0),
'references_table': row(1),
'references_column': row(2)
} for row in self.cursor.fetchall())
def format_schema_for_llm(self, target_tables: Optional(List(str)) = None) -> str:
"""Format the filtered schema for LLM consumption."""
tables = self.get_tables(include_tables=target_tables)
schema_description = "# Database Schema (MSSQL)nn"
for table in tables:
schema = self.get_table_schema(table)
fks = self.get_foreign_keys(table)
schema_description += f"## Table: {table}n"
schema_description += "Columns:n"
for col in schema('columns'):
pk_marker = " (PRIMARY KEY)" if col('name') in schema('primary_keys') else ""
nullable = "NULL" if col('nullable') else "NOT NULL"
schema_description += f"- {col('name')}: {col('type')} {nullable}{pk_marker}n"
if fks:
schema_description += "nForeign Keys:n"
for fk in fks:
schema_description += f"- {fk('column')} → {fk('references_table')}.{fk('references_column')}n"
schema_description += "n"
return schema_description
def close(self):
"""Close database connection."""
self.cursor.close()
self.conn.close()
if __name__ == "__main__":
# 1. Define your connection string
# Change 'YourDatabaseName' and 'YourServerName' to your actual database
db_config = {
"SERVER": "YourServerName",
"DATABASE": "YourDatabaseName",
"Trusted_Connection": "yes",
"Encrypt": "no"
}
# 2. Define the specific tables you want to extract, this is an
# optional field, you can skip it to get details of all tables.
MY_TABLES = ("customers", "orders")
extractor = None
try:
# 3. Initialize and run
extractor = SchemaExtractor(db_config)
schema_text = extractor.format_schema_for_llm(target_tables=MY_TABLES)
print(schema_text)
except Exception as e:
print(f"Error: {e}")
finally:
if extractor:
extractor.close()
Why markdown?
You may wonder why we convert the schema to Markdown. LLMs are given heavy training on documentation and code repositories (like GitHub). They are incredibly efficient at parsing Markdown headers and lists, making it the most “token-efficient” way to pass a reference to the model.
Give database configuration details, here are using local DB, if you are using database on server then give configuration details accordingly.
“The Brain”: Orchestrating LLM Logic
If schema is our “secret sauce”, it’s class “Brain” He knows how to cook with it.
SQLQueryGenerator Doesn’t just move text back and forth; It acts as a translator. It takes your unstructured, human question and your structured schema, then applies a strict set of T-SQL architectural rules To make sure the output actually runs on your server.
Why is this “brain” different:
- Provider Agnostic: while i use Olama For local privacy here, the architecture is built on the OpenAI standard, meaning you can swap to GPT-4 or the cloud with one line of code.
- Zero-shot precision: we use a
temperatureOf0.1. In creative writing, you want high temperatures; In SQL generation, you want the LLM to be as “boring” and predictable as possible. - Syntax Railing: We explicitly teach LLMs to avoid common pitfalls – like using
LIMIT(Postgres/MySQL) When should it be usedTOP(MS SQL Server).
# simple_query_generator.py
import os
from typing import Optional
from openai import OpenAIclass SimpleSQLQueryGenerator:
def __init__(self, provider: str = "ollama", model: str = None):
"""
Initialize the query generator.Args:
provider: "ollama"
model: Model name (optional, uses defaults)
"""
self.provider = providerif provider == "ollama":
self.client = OpenAI(api_key=os.getenv("OLLAMA_API_KEY"), base_url=os.getenv("OLLAMA_BASE_URL"))
self.model = model or "llama3.2"
else:
raise ValueError(f"Unsupported provider: {provider}")def create_system_prompt(self, schema: str, dialect: str = "postgresql") -> str:
"""Create a detailed system prompt for the LLM."""
return f"""You are an expert T-SQL (Microsoft SQL Server) query generator. Your task is to convert natural language questions into accurate, high-performance MSSQL queries.Database Schema:
{schema}SQL Dialect: {dialect} (T-SQL / SQL Server)
Rules:
1. Generate ONLY the SQL query. No explanations, no markdown code blocks, and no backticks.
2. Use proper JOIN syntax (INNER, LEFT) and always use table aliases (e.g., `customers AS c`).
3. Use TOP for limiting results instead of LIMIT (e.g., `SELECT TOP 10 ...`).
4. For date filtering/extraction, use T-SQL functions like `GETDATE()`, `DATEPART()`, `DATEDIFF()`, and `FORMAT()`.
5. Use `COALESCE` to handle NULL values in calculations or concatenations.
6. When comparing strings, use the `LIKE` operator with `%` wildcards if partial matches are implied.
7. Use `ISNULL()` or `COALESCE()` for NULL-safe aggregations.
8. Follow T-SQL best practices: Use `QUOTED_IDENTIFIER` logic (square brackets `( )`) if table or column names contain spaces or are reserved keywords.
9. For pagination/offset, use `OFFSET 0 ROWS FETCH NEXT N ROWS ONLY` if `TOP` is not suitable.
10. Ensure all column names and table names match the provided schema exactly.Important:
- MS SQL Server is the target; do NOT use `LIMIT`, `ILIKE`, or `TO_TIMESTAMP`.
- If the question is ambiguous, assume the most common business logic.
- If the schema lacks necessary information, provide a brief comment starting with `--` explain why.
- For temporal queries (e.g., "this year"), use `YEAR(order_date) = YEAR(GETDATE())`.
- Always group by all non-aggregated columns when using `GROUP BY`.
"""def generate_query(
self,
question: str,
schema: str,
dialect: str = "mssql"
) -> str:
"""
Generate SQL query from natural language.Args:
question: Natural language question
schema: Database schema
dialect: SQL dialect (postgresql, mysql, sqlserver)Returns:
Generated SQL query
"""
system_prompt = self.create_system_prompt(schema, dialect)
if self.provider == "anthropic":
response = self.client.messages.create(
model=self.model,
max_tokens=1024,
system=system_prompt,
messages=(
{"role": "user", "content": question}
)
)
query = response.content(0).text.strip()elif self.provider == "openai":
response = self.client.chat.completions.create(
model=self.model,
messages=(
{"role": "system", "content": system_prompt},
{"role": "user", "content": question}
),
temperature=0.1 # Lower temperature for more consistent output
)
query = response.choices(0).message.content.strip()
elif self.provider == "ollama":
response = self.client.chat.completions.create(
model=self.model,
messages=(
{"role": "system", "content": system_prompt},
{"role": "user", "content": question}
),
temperature=0.1
)
query = response.choices(0).message.content.strip()# Clean up the query (remove markdown if present)
query = self._clean_query(query)
return querydef _clean_query(self, query: str) -> str:
"""Remove markdown formatting and extra whitespace."""
# Remove SQL markdown blocks
if query.startswith("```sql"):
query = query(6:)
if query.startswith("```"):
query = query(3:)
if query.endswith("```"):
query = query(:-3)return query.strip()
put it all together
Here is the final execution script. It connects to your local DB, extracts the schema, asks queries, and prints the executable SQL.
# main.py
import os
from dotenv import load_dotenv
from schema_extractor import SchemaExtractor
from simple_query_generator import SimpleSQLQueryGenerator# Load environment variables
load_dotenv(override=True)
def main():
# 1. Config & Connection
# If running locally
db_params = {
"SERVER": "(localdb)\MSSQLLocalDB",
"DATABASE": "master",
"Trusted_Connection": "yes",
"Encrypt": "no"
}
"""
# If using environment variables (e.g., for production)
db_params = {
'host': os.getenv('DB_HOST'),
'port': os.getenv('DB_PORT'),
'database': os.getenv('DB_NAME'),
'user': os.getenv('DB_USER'),
'password': os.getenv('DB_PASSWORD')
}
"""
# 2. Extract Schema
my_tables = ("customers", "orders")
extractor = SchemaExtractor(db_params)
schema_md = extractor.format_schema_for_llm(my_tables)
# 3. Generate Query
generator = SimpleSQLQueryGenerator()
user_question = "Show me the top 5 customers by total order amount."
sql_query = generator.generate_query(user_question, schema_md)
print(f"--- Database Schema ---n{schema_md}")
print(f"User Question: {user_question}n")
print(f"--- Generated SQL ---n{sql_query}")
if __name__ == "__main__":
main()
Pro-Tips for Production: Tuning Your “Brain”
Before you press “Publish” on your generator, there are two golden rules to remember when working with LLM and SQL:
1. Model choice matters (but size isn’t everything)
The quality of your SQL depends heavily on the model you choose.
- Large models (GPT-4, Cloud 3.5): Excels at complex logic and deeply rooted subcategories.
- Local models (Llama 3.2, Mistral): Blazing fast and private. While small, they are surprisingly “SQL-literate” if your schema is clean.
2. Prompt Engineering > Model Switching
If the generated SQL is not quite correct, Do not change your model immediately. Most problems can be fixed by tuning your system prompts.
If LLM makes a mistake, try adding a specific rule to it create_system_prompt Method:
- is it using
LIMITinstead ofTOP? Add: “Never use LIMIT; always use SELECT TOP X.” - Are these hallucinogenic pillars? Add: “Use only explicitly listed columns in the given schema.”
- Is it failing on dates? Add: “Use
DATEDIFFOrDATEPARTFor all temporal comparisons.”
outcome?
consisting of secret sauce with (your metadata) Brain (this argument), you convert a simple prompt “Show me the top 5 customers based on total order amount?” Instantly into a complex, multi-join T-SQL statement.
C:sainathudataprojectssimple_sql_query_generator>uv run main.py
--- Database Schema ---
# Database Schema (MSSQL)## Table: customers
Columns:
- customer_
id: int NOT NULL (PRIMARY KEY)
- name: nvarchar NULL
- email: nvarchar NULL
- created_at: datetime2 NULL## Table: orders
Columns:
- order_
id: int NOT NULL (PRIMARY KEY)
- customer_id: int NULL
- order_date: date NULL
- total_amount: decimal NULLForeign Keys:
- customer_
id → customers.customer_idUser Question: Show me the top 5 customers by total order amount.
--- Generated SQL ---
SELECT TOP 5
c.customer_
id,
c.name,
COALESCE(SUM(o.total_amount), 0) AS total_order_amount
FROM
customers AS c
LEFT JOIN
orders AS o ON c.customer_id = o.customer_id
GROUP BY
c.customer_id, c.name
ORDER BY
total_order_amount DESC;
GitHub
GitHub – cyanthudata/simple_sql_query_generator_llm
Contribute to sainthudata/simple_sql_query_generator_llm development by creating an account on GitHub.
github.com
In the next article we will see how to verify the generated SQL and execute them.
Published via Towards AI
