Tutorial 03 — Dataset hierarchy and HTML reports¶
This notebook shows the main “chemical space browser” workflow:
- create a list of ligands
- build a hierarchy: denticity → topology → skeleton → ligand
- export a single self-contained HTML that you can share
Prerequisites¶
RDKit + cTopo installed in the environment.
try:
from rdkit import Chem
except ImportError as e:
raise ImportError('RDKit is required: `conda install -c conda-forge rdkit`.') from e
from IPython.display import HTML
from ctopo import ligand_from_smiles
from ctopo.trees import build_ligand_tree, tree_to_html
1. A tiny example dataset¶
In real projects you’d load ligands from a file or extract them from complexes. Here we just build a small set by hand.
examples = {
# bidentate N-N
'en': '[NH2:1]CC[NH2:2]',
'en-Me': '[NH2:1]C(C)C[NH2:2]',
'en-Me2': '[NH2:1]C(C)C(C)[NH2:2]',
'en-cHex': '[NH2:1]C(CCCC1)C1[NH2:2]',
'biph': 'c1ccc[n:1]c1-c1[n:2]cccc1',
'biph-OMe': 'c1cc(OC)c[n:1]c1-c1[n:2]cc(OC)cc1',
'phen': 'c1cc2ccc3ccc[n:1]c3c2[n:2]c1',
'pn': '[NH2:1]CCC[NH2:2]',
'pn-Et': '[NH2:1]CC(CC)C[NH2:2]',
# bidentate N-O
'gly': '[NH2:1]CC(=O)[O-:2]',
'ala': '[NH2:1]C(C)C(=O)[O-:2]',
'pyCOOH': 'c1ccc[n:1]c1C(=O)[O-:2]',
# bidentate O-O
'ox': '[O-:1]C(=O)C(=O)[O-:2]',
'catechol': '[O-:1]c(cccc1)c1[O-:2]',
'acac': '[O-:1]C(=O)C(=O)[O-:2]',
# linear tridentate
'NNhN': 'C[N:1](C)CC[NH:2]CC[N:3](C)C',
'PNhN': 'C[P:1](C)CC[NH:2]CC[N:3](C)C',
'PNhP': 'C[P:1](C)CC[NH:2]CC[P:3](C)C',
'SNhN': 'C[S:1]CC[NH:2]CC[N:3](C)C',
'SNhP': 'C[S:1]CC[NH:2]CC[P:3](C)C',
'SNhS': 'C[S:1]CC[NH:2]CC[S:3]C',
'NNmeN': 'C[N:1](C)CC[N:2](C)CC[N:3](C)C',
'PNmeN': 'C[P:1](C)CC[N:2](C)CC[N:3](C)C',
'PNmeP': 'C[P:1](C)CC[N:2](C)CC[P:3](C)C',
'SNmeN': 'C[S:1]CC[N:2](C)CC[N:3](C)C',
'SNmeP': 'C[S:1]CC[N:2](C)CC[P:3](C)C',
'SNmeS': 'C[S:1]CC[N:2](C)CC[S:3]C',
'NNpyN': 'C[N:1](C)Cc(ccc1)[n:2]c1C[N:3](C)C',
'PNpyN': 'C[P:1](C)Cc(ccc1)[n:2]c1C[N:3](C)C',
'PNpyP': 'C[P:1](C)Cc(ccc1)[n:2]c1C[P:3](C)C',
'SNpyN': 'C[S:1]Cc(ccc1)[n:2]c1C[N:3](C)C',
'SNpyP': 'C[S:1]Cc(ccc1)[n:2]c1C[P:3](C)C',
'SNpyS': 'C[S:1]Cc(ccc1)[n:2]c1C[S:3]C',
# tripod tridentate
'tripod_N_N3': 'N(CC[NH2:1])(CC[NH2:2])CC[NH2:3]',
'tripod_N_N3_long1': 'N(CC[NH2:1])(CC[NH2:2])CCC[NH2:3]',
'tripod_N_N3_long3': 'N(CCC[NH2:1])(CCC[NH2:2])CCC[NH2:3]',
'tripod_C_N3': 'C(CC[NH2:1])(CC[NH2:2])CC[NH2:3]',
'tripod_C_N3_long1': 'C(CC[NH2:1])(CC[NH2:2])CCC[NH2:3]',
'tripod_C_N3_long3': 'C(CCC[NH2:1])(CCC[NH2:2])CCC[NH2:3]',
'tripod_B_N3': '[BH-](CC[NH2:1])(CC[NH2:2])CC[NH2:3]',
'tripod_B_N3_py': '[BH-](c1[n:1]cccc1)(c1[n:2]cccc1)c1[n:3]cccc1',
}
ligands = [ligand_from_smiles(smi) for smi in examples.values()]
ligand_ids = list(examples.keys())
print(f'Total number of ligands: {len(ligands)}')
Total number of ligands: 41
2. Build the tree¶
build_ligand_tree groups ligands by a sequence of “levels”.
The default levels are intended for large (> 100) datasets:
- denticity
- topology
- skeleton
- skeleton (+bond types)
- skeleton (+donor labels +bond types)
- full ligand
You can provide your own level sequence depending on the problem under consideration. Two practical patterns:
- Fast overview: stop at topology or skeleton
- Strict canonical: include bond types and donor labels before the final ligand
levels = ('denticity', ('topo',), ('skeleton',), "ligand")
tree = build_ligand_tree(ligands, ligand_ids=ligand_ids, levels=levels)
tree.number_of_nodes(), tree.number_of_edges()
(54, 53)
3. Export HTML¶
The HTML report is self-contained (inline SVG). That makes it easy to email or attach as supplementary material.
html = tree_to_html(tree)
# Preview in the notebook (works best locally)
HTML(html)
root
DA=2 [1/2]
topo [1/1]
skeleton [1/2]
skeleton [2/2]
DA=3 [2/2]
topo [1/2]
skeleton [1/1]
topo [2/2]
skeleton [1/4]
skeleton [2/4]
skeleton [3/4]
skeleton [4/4]
Save to a file¶
from pathlib import Path
out = Path("ligand_tree_demo.html")
out.write_text(html, encoding="utf-8")
out.resolve()
WindowsPath('D:/Work/Science/_Codes/ctopo/docs/tutorials/ligand_tree_demo.html')
4. Customizing levels¶
By default, build_ligand_tree() groups a dataset using a hierarchy that goes from coarse to fine representations:
denticity → topology → skeleton → ligand
This order is fixed because each step adds information. In other words, keys are inherited: every node at a deeper level must represent a refinement of a parent node at the previous level.
4.1 Available level families¶
Levels belong to four conceptual groups:
- Denticity. Number of donor atoms in a ligand. This is usually the first split, because denticity defines the coordination mode and strongly constrains possible skeletons/topologies.
- Topology has two variants:
- bare topology: connectivity pattern only
- topology with donor atoms (DAs): donors shown explicitly as DA nodes, so topologies are comparable across donor-element types
- Skeleton has 3 independent toggles:
donors: keep donor atoms identies (instead of marked dummy nodes)bonds: keep bond types inside the skeleton instead of default single bondsskeleton: keep skeleton atom identities (instead of dummy nodes)- Inheritance rule: skeleton levels must be ordered from less detailed to more detailed (a more detailed skeleton key must refine a less detailed skeleton key).
- Ligand nodes are leafs of the tree and correspond to actual ligands. If your dataset contains repeats (e.g. ligands extracted from many complexes), leaf nodes contain a count.
4.2 Good examples¶
# topo + DA => skeleton + DA + bonds
levels = ['denticity', ('topo', 'donors'), ('skeleton', 'donors', 'bonds'), 'ligand']
tree = build_ligand_tree(ligands, levels=levels)
# topo => skeleton => skeleton + DA
levels = ['denticity', ('topo',), ('skeleton',), ('skeleton', 'donors'), 'ligand']
tree = build_ligand_tree(ligands, levels=levels)
4.3 Bad examples¶
levels = ['denticity', ('skeleton',), ('topology',), 'ligand'] # skeleton before topology
try:
tree = build_ligand_tree(ligands, levels=levels)
except Exception as e:
print('Invalid level specification: skeleton used before topology')
Invalid level specification: skeleton used before topology
levels = ['denticity', ('skeleton', 'DA'), ('skeleton', 'bonds'), 'ligand'] # lost DA toggle
try:
tree = build_ligand_tree(ligands, levels=levels)
except Exception as e:
print('Invalid level specification: loist DA toggle')
Invalid level specification: loist DA toggle
Takeaways¶
- Hierarchical grouping produces an interpretable map of ligand space.
- The HTML report is an easy “deliverable” for collaborators.
Next: Tutorial 04 demonstrates role-aware fingerprints and similarity on skeleton-only vs full-ligand views.