In this tutorial, we build a complete pipeline for single-cell RNA sequencing analysis scanpy. We start by installing the required libraries and loading the PBMC 3k dataset, then perform quality control, filtering, and normalization to prepare the data for downstream analysis. We then identify highly variable genes, perform PCA for dimensionality reduction, and construct a neighborhood graph to generate UMAP embeddings and Leiden clusters. Through marker gene discovery and visualization, we explore how clusters match biological cell populations and apply a simple rule-based annotation strategy to predict cell types.
import sys
import subprocess
import importlib
def pip_install(*packages):
subprocess.check_call((sys.executable, "-m", "pip", "install", "-q", *packages))
required = (
"scanpy",
"anndata",
"leidenalg",
"igraph",
"harmonypy",
"seaborn"
)
pip_install(*required)
import os
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc
import anndata as ad
sc.settings.verbosity = 2
sc.settings.set_figure_params(dpi=110, facecolor="white", frameon=False)
np.random.seed(42)
print("Scanpy version:", sc.__version__)
adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()
print("nInitial AnnData:")
print(adata)
adata.layers("counts") = adata.X.copy()
adata.var("mt") = adata.var_names.str.upper().str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=("mt"), percent_top=None, log1p=False, inplace=True)
print("nQC summary:")
display(
adata.obs(("n_genes_by_counts", "total_counts", "pct_counts_mt")).describe().T
)
We install all the required dependencies and import the core scientific computing libraries required for the analysis. We configure ScanPi settings, initialize the environment, and load the PBMC 3k single-cell RNA-seq dataset. We then calculate quality-control metrics, including mitochondrial gene percentage, total count, and number of genes detected for each cell.
fig, axs = plt.subplots(1, 3, figsize=(15, 4))
sc.pl.violin(adata, ("n_genes_by_counts"), jitter=0.4, ax=axs(0), show=False)
sc.pl.violin(adata, ("total_counts"), jitter=0.4, ax=axs(1), show=False)
sc.pl.violin(adata, ("pct_counts_mt"), jitter=0.4, ax=axs(2), show=False)
plt.tight_layout()
plt.show()
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts", color="pct_counts_mt")
adata = adata(adata.obs("n_genes_by_counts") >= 200).copy()
adata = adata(adata.obs("n_genes_by_counts") <= 5000).copy()
adata = adata(adata.obs("pct_counts_mt") < 10).copy()
sc.pp.filter_genes(adata, min_cells=3)
print("nAfter filtering:")
print(adata)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata.copy()
sc.pp.highly_variable_genes(
adata,
flavor="seurat",
min_mean=0.0125,
max_mean=3,
min_disp=0.5
)
print("nHighly variable genes selected:", int(adata.var("highly_variable").sum()))
sc.pl.highly_variable_genes(adata)
adata = adata(:, adata.var("highly_variable")).copy()
We visualize quality control metrics by using plots to examine the distribution of gene counts and mitochondrial content. We apply a filtering step to remove low-quality cells and genes that do not meet basic expression thresholds. We then normalize the data, apply log transformation, and identify highly variable genes for downstream analysis.
sc.pp.regress_out(adata, ("total_counts", "pct_counts_mt"))
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, log=True)
sc.pl.pca(adata, color=None)
sc.pp.neighbors(adata, n_neighbors=12, n_pcs=30, metric="euclidean")
sc.tl.umap(adata, min_dist=0.35, spread=1.0)
sc.tl.leiden(adata, resolution=0.6, key_added="leiden")
print("nCluster counts:")
display(adata.obs("leiden").value_counts().sort_index().rename("cells_per_cluster").to_frame())
sc.pl.umap(adata, color=("leiden"), legend_loc="on data", title="PBMC 3k - Leiden clusters")
sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
marker_table = sc.get.rank_genes_groups_df(adata, group=None)
print("nTop marker rows:")
display(marker_table.head(20))
We remove technical confounders and scale the dataset to prepare for dimensionality reduction. We perform principal component analysis to capture the key variance structure of the dataset. We then construct neighborhood graphs, compute UMAP embeddings, perform Leiden clustering, and identify marker genes for each cluster.
top_markers_per_cluster = (
marker_table.groupby("group")
.head(10)
.loc(:, ("group", "names", "logfoldchanges", "pvals_adj"))
.reset_index(drop=True)
)
print("nTop 10 markers per cluster:")
display(top_markers_per_cluster)
candidate_markers = (
"IL7R", "LTB", "MALAT1", "CCR7",
"NKG7", "GNLY", "PRF1",
"MS4A1", "CD79A", "CD79B",
"LYZ", "S100A8", "FCER1A", "CST3",
"PPBP", "FCGR3A", "LGALS3", "CTSS",
"CD3D", "TRBC1", "TRAC"
)
candidate_markers = (g for g in candidate_markers if g in adata.var_names)
if candidate_markers:
sc.pl.dotplot(
adata,
var_names=candidate_markers,
groupby="leiden",
standard_scale="var",
dendrogram=True
)
sc.pl.matrixplot(
adata,
var_names=candidate_markers,
groupby="leiden",
standard_scale="var",
dendrogram=True
)
cluster_marker_reference = {
"T_cells": ("IL7R", "LTB", "CCR7", "CD3D", "TRBC1", "TRAC"),
"NK_cells": ("NKG7", "GNLY", "PRF1"),
"B_cells": ("MS4A1", "CD79A", "CD79B"),
"Monocytes": ("LYZ", "FCGR3A", "LGALS3", "CTSS", "S100A8", "CST3"),
"Dendritic_cells": ("FCER1A", "CST3"),
"Platelets": ("PPBP")
}
We examine the most significant marker genes found for each cluster and summarize the top markers. We visualized gene expression patterns in groups using dot plots and matrix plots for known immune cell markers. We also define reference mapping of marker genes associated with major immune cell types for subsequent annotation.
available_reference = {
celltype: (g for g in genes if g in adata.var_names)
for celltype, genes in cluster_marker_reference.items()
}
available_reference = {k: v for k, v in available_reference.items() if len(v) > 0}
for celltype, genes in available_reference.items():
sc.tl.score_genes(adata, gene_list=genes, score_name=f"{celltype}_score", use_raw=False)
score_cols = (f"{ct}_score" for ct in available_reference.keys())
cluster_scores = adata.obs.groupby("leiden")(score_cols).mean()
display(cluster_scores)
cluster_to_celltype = {}
for cluster in cluster_scores.index:
best = cluster_scores.loc(cluster).idxmax().replace("_score", "")
cluster_to_celltype(cluster) = best
adata.obs("cell_type") = adata.obs("leiden").map(cluster_to_celltype).astype("category")
print("nCluster to cell-type mapping:")
display(pd.DataFrame.from_dict(cluster_to_celltype, orient="index", columns=("assigned_cell_type")))
sc.pl.umap(
adata,
color=("leiden", "cell_type"),
legend_loc="on data",
wspace=0.45
)
sc.tl.rank_genes_groups(adata, groupby="cell_type", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=15, sharey=False)
celltype_markers = sc.get.rank_genes_groups_df(adata, group=None)
print("nTop markers by annotated cell type:")
display(
celltype_markers.groupby("group").head(8)(("group", "names", "logfoldchanges", "pvals_adj"))
)
cluster_prop = (
adata.obs("cell_type")
.value_counts(normalize=True)
.mul(100)
.round(2)
.rename("percent")
.to_frame()
)
print("nCell-type proportions (%):")
display(cluster_prop)
plt.figure(figsize=(7, 4))
cluster_prop("percent").sort_values().plot(kind="barh")
plt.xlabel("Percent of cells")
plt.ylabel("Cell type")
plt.title("Estimated cell-type composition")
plt.tight_layout()
plt.show()
output_dir = "scanpy_pbmc3k_outputs"
os.makedirs(output_dir, exist_ok=True)
adata.write(os.path.join(output_dir, "pbmc3k_scanpy_advanced.h5ad"))
marker_table.to_csv(os.path.join(output_dir, "cluster_markers.csv"), index=False)
celltype_markers.to_csv(os.path.join(output_dir, "celltype_markers.csv"), index=False)
cluster_scores.to_csv(os.path.join(output_dir, "cluster_score_matrix.csv"))
print(f"nSaved outputs to: {output_dir}")
print("Files:")
for f in sorted(os.listdir(output_dir)):
print(" -", f)
summary = {
"n_cells_final": int(adata.n_obs),
"n_genes_final": int(adata.n_vars),
"n_clusters": int(adata.obs("leiden").nunique()),
"clusters": sorted(adata.obs("leiden").unique().tolist()),
"cell_types": sorted(adata.obs("cell_type").unique().tolist()),
}
print("nAnalysis summary:")
for k, v in summary.items():
print(f"{k}: {v}")
We score each cell using known marker gene sets and assign potential cell types to groups based on expression patterns. We visualize annotated cell types on UMAP embeddings and perform differential gene expression analysis in predicted cell populations. Additionally, we calculate cell-type proportions, generate summary visualizations, and save the processed datasets and analysis outputs for further research.
In conclusion, we developed a complete end-to-end workflow to analyze single-cell transcriptomic data using ScanPy. We performed preprocessing, clustering, marker-gene analysis and cell-type annotation, and visualized the data structure using UMAP and gene expression plots. By saving the processed nData objects and analysis output, we created a reusable dataset for further biological interpretation and advanced modeling. This workflow demonstrates how ScanPy enables scalable, reproducible single-cell analysis through a structured, modular Python pipeline.
check it out full code here. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
The post A Coding Guide to Building a Complete Single Cell RNA Sequencing Analysis Pipeline Using ScanPy for Clustering Visualization and Cell Type Annotation appeared first on MarkTechPost.
