A production-style NetworkKit 11.2.1 coding tutorial for large-scale graph analytics, community, cores, and sparsification

In this tutorial, we implement a production-grade, large-scale graph analytics pipeline networkkitFocusing on speed, memory efficiency, and version-safe APIs in NetworkKit 11.2.1. We generate a large-scale free network, extract the largest connected component, and then calculate the structural backbone signals through k-core decomposition and centrality ranking. We also explore communities with PLM and quantify quality using modularity; Estimate distance structure using effective and projected diameters; And, finally, partition the graph to reduce costs while preserving key properties. We export the sparsified graph as an edgelist so that we can reuse it in downstream workflows, benchmarking or graph ML preprocessing.

!pip -q install networkit pandas numpy psutil


import gc, time, os
import numpy as np
import pandas as pd
import psutil
import networkit as nk


print("NetworKit:", nk.__version__)
nk.setNumberOfThreads(min(2, nk.getMaxNumberOfThreads()))
nk.setSeed(7, False)


def ram_gb():
   p = psutil.Process(os.getpid())
   return p.memory_info().rss / (1024**3)


def tic():
   return time.perf_counter()


def toc(t0, msg):
   print(f"{msg}: {time.perf_counter()-t0:.3f}s | RAM~{ram_gb():.2f} GB")


def report(G, name):
   print(f"n({name}) nodes={G.numberOfNodes():,} edges={G.numberOfEdges():,} directed={G.isDirected()} weighted={G.isWeighted()}")


def force_cleanup():
   gc.collect()


PRESET = "LARGE"


if PRESET == "LARGE":
   N = 120_000
   M_ATTACH = 6
   AB_EPS = 0.12
   ED_RATIO = 0.9
elif PRESET == "XL":
   N = 250_000
   M_ATTACH = 6
   AB_EPS = 0.15
   ED_RATIO = 0.9
else:
   N = 80_000
   M_ATTACH = 6
   AB_EPS = 0.10
   ED_RATIO = 0.9


print(f"nPreset={PRESET} | N={N:,} | m={M_ATTACH} | approx-betweenness epsilon={AB_EPS}")

We set up the Colab environment with NetworkKit and the monitoring utilities, and we lock a stable random seed. We configure thread usage to match the runtime and define timing and RAM-tracking helpers for each key stage. We choose a scale preset that controls the graph size and approximation knobs so that the pipeline remains large but manageable.

t0 = tic()
G = nk.generators.BarabasiAlbertGenerator(M_ATTACH, N).generate()
toc(t0, "Generated BA graph")
report(G, "G")


t0 = tic()
cc = nk.components.ConnectedComponents(G)
cc.run()
toc(t0, "ConnectedComponents")
print("components:", cc.numberOfComponents())


if cc.numberOfComponents() > 1:
   t0 = tic()
   G = nk.graphtools.extractLargestConnectedComponent(G, compactGraph=True)
   toc(t0, "Extracted LCC (compactGraph=True)")
   report(G, "LCC")


force_cleanup()

We create a large Barabasi-Albert graph and immediately log its size and runtime footprint. We calculate connected components to understand fragmentation and quickly diagnose the topology. We remove the largest connected component and compact it to improve the rest of the pipeline’s performance and reliability.

t0 = tic()
core = nk.centrality.CoreDecomposition(G)
core.run()
toc(t0, "CoreDecomposition")
core_vals = np.array(core.scores(), dtype=np.int32)
print("degeneracy (max core):", int(core_vals.max()))
print("core stats:", pd.Series(core_vals).describe(percentiles=(0.5, 0.9, 0.99)).to_dict())


k_thr = int(np.percentile(core_vals, 97))


t0 = tic()
nodes_backbone = (u for u in range(G.numberOfNodes()) if core_vals(u) >= k_thr)
G_backbone = nk.graphtools.subgraphFromNodes(G, nodes_backbone)
toc(t0, f"Backbone subgraph (k>={k_thr})")
report(G_backbone, "Backbone")


force_cleanup()


t0 = tic()
pr = nk.centrality.PageRank(G, damp=0.85, tol=1e-8)
pr.run()
toc(t0, "PageRank")


pr_scores = np.array(pr.scores(), dtype=np.float64)
top_pr = np.argsort(-pr_scores)(:15)
print("Top PageRank nodes:", top_pr.tolist())
print("Top PageRank scores:", pr_scores(top_pr).tolist())


t0 = tic()
abw = nk.centrality.ApproxBetweenness(G, epsilon=AB_EPS)
abw.run()
toc(t0, "ApproxBetweenness")


abw_scores = np.array(abw.scores(), dtype=np.float64)
top_abw = np.argsort(-abw_scores)(:15)
print("Top ApproxBetweenness nodes:", top_abw.tolist())
print("Top ApproxBetweenness scores:", abw_scores(top_abw).tolist())


force_cleanup()

We calculate core decomposition to measure the distortion and identify the high-density backbone of the network. We extract a backbone subgraph using a high core-percentage threshold to focus on structurally important nodes. We run PageRank and estimate the mean between them to rank nodes based on scale influence and bridge-like behavior.

t0 = tic()
plm = nk.community.PLM(G, refine=True, gamma=1.0, par="balanced")
plm.run()
toc(t0, "PLM community detection")


part = plm.getPartition()
num_comms = part.numberOfSubsets()
print("communities:", num_comms)


t0 = tic()
Q = nk.community.Modularity().getQuality(part, G)
toc(t0, "Modularity")
print("modularity Q:", Q)


sizes = np.array(list(part.subsetSizeMap().values()), dtype=np.int64)
print("community size stats:", pd.Series(sizes).describe(percentiles=(0.5, 0.9, 0.99)).to_dict())


t0 = tic()
eff = nk.distance.EffectiveDiameter(G, ED_RATIO)
eff.run()
toc(t0, f"EffectiveDiameter (ratio={ED_RATIO})")
print("effective diameter:", eff.getEffectiveDiameter())


t0 = tic()
diam = nk.distance.EstimatedDiameter(G)
diam.run()
toc(t0, "EstimatedDiameter")
print("estimated diameter:", diam.getDiameter().distance)


force_cleanup()

We detect communities using PLM and record the number of communities found on the large graph. We calculate modularity and summarize community-size data to validate the structure rather than relying solely on segmentation. We estimate global distance behavior using effective diameter and projected diameter in an API-safe manner for NetworkKit 11.2.1.

t0 = tic()
sp = nk.sparsification.LocalSimilaritySparsifier(G, 0.7)
G_sparse = sp.getSparsifiedGraph()
toc(t0, "LocalSimilarity sparsification (alpha=0.7)")
report(G_sparse, "Sparse")


t0 = tic()
pr2 = nk.centrality.PageRank(G_sparse, damp=0.85, tol=1e-8)
pr2.run()
toc(t0, "PageRank on sparse")
pr2_scores = np.array(pr2.scores(), dtype=np.float64)
print("Top PR nodes (sparse):", np.argsort(-pr2_scores)(:15).tolist())


t0 = tic()
plm2 = nk.community.PLM(G_sparse, refine=True, gamma=1.0, par="balanced")
plm2.run()
toc(t0, "PLM on sparse")
part2 = plm2.getPartition()
Q2 = nk.community.Modularity().getQuality(part2, G_sparse)
print("communities (sparse):", part2.numberOfSubsets(), "| modularity (sparse):", Q2)


t0 = tic()
eff2 = nk.distance.EffectiveDiameter(G_sparse, ED_RATIO)
eff2.run()
toc(t0, "EffectiveDiameter on sparse")
print("effective diameter (orig):", eff.getEffectiveDiameter(), "| (sparse):", eff2.getEffectiveDiameter())


force_cleanup()


out_path = "/content/networkit_large_sparse.edgelist"
t0 = tic()
nk.graphio.EdgeListWriter("t", 0).write(G_sparse, out_path)
toc(t0, "Wrote edge list")
print("Saved:", out_path)


print("nAdvanced large-graph pipeline complete.")

We stretch the graph using local similarity to reduce the number of edges while maintaining a structure useful for downstream analytics. We re-run PageRank, PLM and effective diameter on sparsified graphs to check whether the main signals remain consistent. We export the sparsified graph as an edgelist so that we can reuse it in sessions, tools, or additional experiments.

Finally, we developed an end-to-end, scalable NetworkKit workflow that mirrors real large-network analysis: we started from generation, stabilized the topology with LCC extraction, characterized the structure through cores and centralities, discovered communities and validated them with modularity, and captured global distance behavior through diameter estimates. We then applied sparsification to shrink the graph while keeping it analytically meaningful and saving it for repeatable pipelines. The tutorial provides a practical template that we can reuse for real datasets by replacing the generator with EdgeList Reader, keeping the same analysis steps, performance tracking, and export steps.

check it out full code here. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

A production-style NetworkKit 11.2.1 coding tutorial for large-scale graph analytics, community, cores, and sparsification

Anthropic report says it’s too early for AI to impact jobs

AI agents can pose a threat to humanity. We must work to prevent that future. David Krueger

Related Articles

Leave a Comment Cancel Reply