Tutorial 04 — Role-aware fingerprints (skeleton vs full ligand)¶
This notebook demonstrates how cTopo fingerprints can be configured to focus on different parts of a ligand:
- original (full ligand)
- skeleton (donors + cage-defining connectors)
- substituent (everything outside the skeleton)
Prerequisites¶
RDKit + cTopo installed. Fingerprinting itself is RDKit-free, but tutorials use RDKit for molecule parsing.
try:
from rdkit import Chem
except ImportError as e:
raise ImportError('RDKit is required for these tutorials.') from e
from ctopo import ligand_from_smiles
from ctopo.descriptors import MorganSpec, DEFAULT_PROPERTIES, make_fingerprinter
from ctopo.distances import tanimoto_similarity_bits
1. Two ligands that share the same skeleton but differ in substituents¶
We take ethylenediamine and an N-methylated analogue. The methyl group is classified as a substituent, not part of the skeleton.
from IPython.display import SVG, display
lig_a = ligand_from_smiles('[NH2:1]C(CC)C[NH2:2]')
lig_b = ligand_from_smiles('C[NH:1]CC[NH:2]C')
display(SVG(lig_a.visualize_ligand().svg))
display(SVG(lig_b.visualize_ligand().svg))
2. Configure Morgan fingerprints¶
We create three fingerprinters:
fp_full: fingerprints the full ligand graphfp_skel: fingerprints the skeleton subgraph onlyfp_sub: fingerprints the substituents subgraph only
All emit bit fingerprints (sets of on-bit indices) so we can compute Tanimoto similarity.
fp_full = make_fingerprinter(
kind='morgan',
spec=MorganSpec(radius=2, use_chirality=False),
atomic_properties=['atom_type', 'Z', 'degree', 'num_pi_electrons'],
graph_view='original',
bond_mode='all',
output='bits',
fp_size=1024,
)
fp_skel = make_fingerprinter(
kind='morgan',
spec=MorganSpec(radius=2, use_chirality=False),
atomic_properties=['atom_type', 'Z', 'degree', 'num_pi_electrons'],
graph_view='skeleton',
bond_mode='all',
output='bits',
fp_size=1024,
)
fp_sub = make_fingerprinter(
kind='morgan',
spec=MorganSpec(radius=2, use_chirality=False),
atomic_properties=['atom_type', 'Z', 'degree', 'num_pi_electrons'],
graph_view='substituent',
bond_mode='all',
output='bits',
fp_size=1024,
)
fA_full, fB_full = fp_full(lig_a), fp_full(lig_b)
fA_skel, fB_skel = fp_skel(lig_a), fp_skel(lig_b)
fA_sub, fB_sub = fp_sub(lig_a), fp_sub(lig_b)
sim_full = tanimoto_similarity_bits(fA_full, fB_full)
sim_skel = tanimoto_similarity_bits(fA_skel, fB_skel)
sim_sub = tanimoto_similarity_bits(fA_sub, fB_sub)
print(f'Full similarity: {sim_full:.3f}')
print(f'Skeleton similarity: {sim_skel:.3f}')
print(f'Substituents similarity: {sim_sub:.3f}')
Full similarity: 0.267 Skeleton similarity: 1.000 Substituents similarity: 0.333
Interpretation:
sim_fullcaptures everything, so methylation reduces similarity.sim_skelshould stay higher because the cage-defining part is unchanged.sim_subisolates only substituent information, so it is most sensitive to methylation.
3. Fingerprinter settings: what exists and when to use what¶
The high-level entry point is:
from ctopo.descriptors import make_fingerprinter, MorganSpec, AtomPairsSpec, DEFAULT_PROPERTIES
Internally, a fingerprinter does (in this order):
- builds a graph view (optional subgraph + relabeling)
- writes fingerprint bond codes depending on
bond_mode - computes atomic invariants (hashed from selected node attributes)
- resolves emit_from (which atoms are allowed to emit environments)
- runs the core algorithm (Morgan or AtomPairs)
- formats output (sparse_counts, folded_counts, bits)
3.1 Fingerprint kind: morgan vs atompairs¶
Morgan (ECFP-style)¶
- emits local environments around atoms up to a radius (
MorganSpec.radius) - sensitive to local substituents and "what is attached where"
- good default for diverse ligand sets or when local chemistry matters
Use Morgan when:
- donors/skeleton motifs vary across the dataset
- substituent chemistry affects your property (sterics, electronics, solubility proxies)
- you expect local modifications to matter more than global rearrangements
AtomPairs¶
- emits features for all atom pairs: (distance, inv_i, inv_j)
- very sensitive to global arrangements and connectivity distances
- often useful when ligands share the same fragments but differ by how fragments are connected
Use AtomPairs when:
- dataset contains many “same building blocks, different wiring”
- you want to emphasize overall shape/topological distances over local environments
- you want a simpler model (fewer knobs: distance range only)
Caveat: cTopo’s AtomPairs is "ctopo-native": it's not meant to be RDKit-identical.
3.2 graph_view: physically cut the graph before fingerprinting¶
graph_view selects which atoms are kept in the graph (then nodes are relabeled to 0..n-1, while original indices are preserved as node attribute idx).
Available options:
"original": keep everything (except metals ifkeep_metals=False)."skeleton": keep DONOR + SKELETON atoms only."substituent": keep SUBSTITUENT atoms only."skeleton_alpha_substituents": keep skeleton plus one-bond "alpha" substituents attached to it."substituent_alpha_skeleton": keep substituents plus one-bond "alpha" skeleton attached to them.
You can also pass an iterable of original atom indices, but it's a rare case since the list of atom indices will be generally different for different ligands and you need to generate fingerprinter for each ligand separately.
When to use graph views
- Use
"skeleton"to focus on the coordination cage-defining part. - Use
"skeleton_alpha_substituents"if you want a compromise: cage + immediate substitution pattern. - Use
"substituent"only if you intentionally want to compare substituent chemistry independently of the cage.
3.3 emit_from: restrict emission without cutting the graph¶
emit_from controls which atoms are allowed to emit features, while the algorithm can still "see" the whole graph (depending on the algorithm and radius).
Accepted values:
Noneor"all": emit from all atoms- named regimes (based on
atom_type):"skeleton"(includes donors by convention)"substituent""donor""center"(metals; complexes only)
As with graph views, you can pass an iterable of original atom indices, and it's also not the best solution in most cases.
Why this is useful
- Compared to cutting (
graph_view="skeleton"), emission restriction can keep contextual information. - Example: emit only from donors/skeleton, but still let substituents influence environments (Morgan radius will “reach out” if large enough).
Rule of thumb
- If you want to ignore some atoms completely, use graph_view.
- If you want those atoms to remain context but not be emission centers, use emit_from.
3.4 atomic_properties: atomic invariant construction (the most important knob)¶
Atomic invariants are hashed from node attributes. Allowed keys are:
atom_type, Z, degree, heavy_degree, num_pi_electrons, num_hs, charge, in_ring, aromatic
You can pass either:
- a single ordered list (applied to all atom types), or
- a dict mapping
atom_type -> list[str](fallback isDEFAULT_PROPERTIES)
Three practical presets (matching common RDKit-ish philosophies) with atom_type added:
# RDKit-like “low info” invariant (closer to classic AtomPairs setups)
INV_SHORT = ('atom_type', 'Z', 'heavy_degree', 'num_pi_electrons')
# RDKit-like “atom-centric” invariant (closer to classic Morgan setups)
INV_CLASSIC = ('atom_type', 'Z', 'degree', 'num_hs', 'charge', 'in_ring')
# “physics-ish / valence-ish” invariant (this is close to cTopo DEFAULT_PROPERTIES)
INV_VALENCE = ('atom_type', 'Z', 'heavy_degree', 'num_pi_electrons', 'num_hs', 'in_ring')
Atom-type–specific atomic properties (advanced)¶
In cTopo, atomic_properties can be provided not only as a single list/tuple (applied to all atoms), but also as a mapping from atom_type to a property list. This lets you emphasize different information for different ligand roles.
Typical use cases:
- keep a richer invariant for DONOR atoms (include charge, aromatic)
- keep a compact invariant for SUBSTITUENT atoms (avoid over-fragmenting by minor variations)
- keep a metal-aware invariant for CENTER atoms in complexes (include charge)
from ctopo.core.atom_types import AtomType
from ctopo.descriptors import make_fingerprinter, MorganSpec, DEFAULT_PROPERTIES
# Atom-type–specific invariant recipes
props_by_type = {
# Donors: keep more chemistry (distinguish anionic vs neutral, aromatic vs aliphatic donors)
int(AtomType.DONOR): ('atom_type', 'Z', 'degree', 'num_hs', 'charge', 'aromatic', 'in_ring'),
# Skeleton: keep moderate detail (enough to distinguish aromatic vs aliphatic cages)
int(AtomType.SKELETON): ('atom_type', 'Z', 'degree', 'num_hs', 'num_pi_electrons', 'in_ring'),
# Substituents: intentionally coarse (avoid exploding feature space on peripheral variations)
int(AtomType.SUBSTITUENT): ('atom_type', 'Z'),
}
# Any atom types not listed fall back to DEFAULT_PROPERTIES
fp = make_fingerprinter(
kind='morgan',
spec=MorganSpec(radius=2, use_bond_types=True),
atomic_properties=props_by_type,
graph_view='original',
emit_from='skeleton',
bond_mode='skeleton_only',
output='folded_counts',
fp_size=2048,
)
This pattern is especially useful when you want role-aware fingerprints: donors and cage atoms carry chemically meaningful information, while substituents are treated more coarsely (or vice versa, depending on your task).
Advice
- Always include
atom_typefor your use case (ligand roles + complexes). - Add
chargewhen you care about anionic donors vs neutral donors. - Add
aromaticif aromaticity is meaningful in your dataset (e.g., pyridine vs amine donors).
Important implementation note
- The property list is ordered and the hash depends on that order. So keep it fixed for reproducibility.
- Changing order is not “wrong”, it’s just a different invariant definition.
3.5 Bond handling: bond_mode + Morgan’s use_bond_types¶
bond_mode writes a dedicated edge attribute used by Morgan as “bond code”:
"all": keep all bond types"all_single": treat all bonds as single"skeleton_only": keep bond types only when both atoms are not substituents"substituent_only": keep bond types only when at least one atom is a substituent
This mainly matters if:
- you use Morgan, and
MorganSpec.use_bond_types=True(default)
Advice
- Want the cage chemistry but don’t want substituent bond-detail noise? Use
bond_mode="skeleton_only" - Want pure topology-ish local environments (ignore bond orders everywhere)? Use
bond_mode="all_single" - AtomPairs doesn’t use bond types (it uses distances), so
bond_modeis mostly irrelevant there.
3.6 Output formats: sparse_counts vs folded_counts vs bits¶
cTopo’s canonical internal representation is sparse counts: {feature_id: count}.
output="sparse_counts": best for analysis, debugging, and if you want to do your own vectorization.output="folded_counts": hash-folds features to a fixed size (fp_size) and keeps counts. Has two formats:folded_format="dict": bit -> count dictionary;"numpy": dense array.
output="bits": hash-folds tofp_sizeand keeps only presence/absence. Has three formats:bits_format="set": (fast for Tanimoto);"dict": bit -> presence/absence dictionary;"numpy": dense 0/1 array.
Advice
- Similarity search / clustering: use
bits(and Tanimoto) - Linear ML models that like counts: use
folded_counts(oftennumpy) - Inspectability / research prototyping: use
sparse_counts
3.7 What else matters (often overlooked)¶
keep_metals(Complexes): if you want metal identity to participate in fingerprints, usegraph_view="original"andkeep_metals=Trueand includeemit_from="center"or allow all. If you want “ligand-only” features, setkeep_metals=False.- Morgan-specific knobs (
MorganSpec)radius: biggest effect after invariantsuse_chirality: usually irrelevant for your topology/skeleton-centric use, but useful for chiral ligandsinclude_redundant_environments: normally keep default (False)
bit_info(Morgan provenance): you can pass a dict to collect “which atoms/radius produced bit X” — very useful for debugging, but can become large.
4. Where this becomes useful¶
For dataset analysis and ML you can build feature sets like:
- skeleton-only fingerprint (geometry / cage tendency)
- substituent-only fingerprint (tuning solubility, sterics, electronics)
- concatenation of both
Takeaways¶
- cTopo can fingerprint different structural roles separately.
- Skeleton-focused features are natural for coordination chemistry.
Next: Tutorial 05 covers complexes (dative bond convention) and extracting ligands from complex datasets.