Tutorial 04 — Role-aware fingerprints (skeleton vs full ligand)¶

This notebook demonstrates how cTopo fingerprints can be configured to focus on different parts of a ligand:

original (full ligand)
skeleton (donors + cage-defining connectors)
substituent (everything outside the skeleton)

Fingerprint views

Prerequisites¶

RDKit + cTopo installed. Fingerprinting itself is RDKit-free, but tutorials use RDKit for molecule parsing.

In [1]:

Copied!





try:
    from rdkit import Chem
except ImportError as e:
    raise ImportError('RDKit is required for these tutorials.') from e

from ctopo import ligand_from_smiles
from ctopo.descriptors import MorganSpec, DEFAULT_PROPERTIES, make_fingerprinter
from ctopo.distances import tanimoto_similarity_bits
try:
    from rdkit import Chem
except ImportError as e:
    raise ImportError('RDKit is required for these tutorials.') from e

from ctopo import ligand_from_smiles
from ctopo.descriptors import MorganSpec, DEFAULT_PROPERTIES, make_fingerprinter
from ctopo.distances import tanimoto_similarity_bits

We take ethylenediamine and an N-methylated analogue. The methyl group is classified as a substituent, not part of the skeleton.

In [2]:

Copied!

from IPython.display import SVG, display

lig_a = ligand_from_smiles('[NH2:1]C(CC)C[NH2:2]')
lig_b = ligand_from_smiles('C[NH:1]CC[NH:2]C')

display(SVG(lig_a.visualize_ligand().svg))
display(SVG(lig_b.visualize_ligand().svg))
from IPython.display import SVG, display

lig_a = ligand_from_smiles('[NH2:1]C(CC)C[NH2:2]')
lig_b = ligand_from_smiles('C[NH:1]CC[NH:2]C')

display(SVG(lig_a.visualize_ligand().svg))
display(SVG(lig_b.visualize_ligand().svg))

No description has been provided for this image

2. Configure Morgan fingerprints¶

We create three fingerprinters:

fp_full: fingerprints the full ligand graph
fp_skel: fingerprints the skeleton subgraph only
fp_sub : fingerprints the substituents subgraph only

All emit bit fingerprints (sets of on-bit indices) so we can compute Tanimoto similarity.

In [3]:

Copied!





fp_full = make_fingerprinter(
    kind='morgan',
    spec=MorganSpec(radius=2, use_chirality=False),
    atomic_properties=['atom_type', 'Z', 'degree', 'num_pi_electrons'],
    graph_view='original',
    bond_mode='all',
    output='bits',
    fp_size=1024,
)

fp_skel = make_fingerprinter(
    kind='morgan',
    spec=MorganSpec(radius=2, use_chirality=False),
    atomic_properties=['atom_type', 'Z', 'degree', 'num_pi_electrons'],
    graph_view='skeleton',
    bond_mode='all',
    output='bits',
    fp_size=1024,
)

fp_sub = make_fingerprinter(
    kind='morgan',
    spec=MorganSpec(radius=2, use_chirality=False),
    atomic_properties=['atom_type', 'Z', 'degree', 'num_pi_electrons'],
    graph_view='substituent',
    bond_mode='all',
    output='bits',
    fp_size=1024,
)
fp_full = make_fingerprinter(
    kind='morgan',
    spec=MorganSpec(radius=2, use_chirality=False),
    atomic_properties=['atom_type', 'Z', 'degree', 'num_pi_electrons'],
    graph_view='original',
    bond_mode='all',
    output='bits',
    fp_size=1024,
)

fp_skel = make_fingerprinter(
    kind='morgan',
    spec=MorganSpec(radius=2, use_chirality=False),
    atomic_properties=['atom_type', 'Z', 'degree', 'num_pi_electrons'],
    graph_view='skeleton',
    bond_mode='all',
    output='bits',
    fp_size=1024,
)

fp_sub = make_fingerprinter(
    kind='morgan',
    spec=MorganSpec(radius=2, use_chirality=False),
    atomic_properties=['atom_type', 'Z', 'degree', 'num_pi_electrons'],
    graph_view='substituent',
    bond_mode='all',
    output='bits',
    fp_size=1024,
)

In [4]:

Copied!





fA_full, fB_full = fp_full(lig_a), fp_full(lig_b)
fA_skel, fB_skel = fp_skel(lig_a), fp_skel(lig_b)
fA_sub,  fB_sub  = fp_sub(lig_a),  fp_sub(lig_b)

sim_full = tanimoto_similarity_bits(fA_full, fB_full)
sim_skel = tanimoto_similarity_bits(fA_skel, fB_skel)
sim_sub  = tanimoto_similarity_bits(fA_sub,  fB_sub)

print(f'Full similarity: {sim_full:.3f}')
print(f'Skeleton similarity: {sim_skel:.3f}')
print(f'Substituents similarity: {sim_sub:.3f}')
fA_full, fB_full = fp_full(lig_a), fp_full(lig_b)
fA_skel, fB_skel = fp_skel(lig_a), fp_skel(lig_b)
fA_sub,  fB_sub  = fp_sub(lig_a),  fp_sub(lig_b)

sim_full = tanimoto_similarity_bits(fA_full, fB_full)
sim_skel = tanimoto_similarity_bits(fA_skel, fB_skel)
sim_sub  = tanimoto_similarity_bits(fA_sub,  fB_sub)

print(f'Full similarity: {sim_full:.3f}')
print(f'Skeleton similarity: {sim_skel:.3f}')
print(f'Substituents similarity: {sim_sub:.3f}')

Full similarity: 0.267
Skeleton similarity: 1.000
Substituents similarity: 0.333

Interpretation:

sim_full captures everything, so methylation reduces similarity.
sim_skel should stay higher because the cage-defining part is unchanged.
sim_sub isolates only substituent information, so it is most sensitive to methylation.

3. Fingerprinter settings: what exists and when to use what¶

The high-level entry point is:

from ctopo.descriptors import make_fingerprinter, MorganSpec, AtomPairsSpec, DEFAULT_PROPERTIES

Internally, a fingerprinter does (in this order):

builds a graph view (optional subgraph + relabeling)
writes fingerprint bond codes depending on bond_mode
computes atomic invariants (hashed from selected node attributes)
resolves emit_from (which atoms are allowed to emit environments)
runs the core algorithm (Morgan or AtomPairs)
formats output (sparse_counts, folded_counts, bits)

3.1 Fingerprint kind: `morgan` vs `atompairs`¶

Morgan (ECFP-style)¶

emits local environments around atoms up to a radius (MorganSpec.radius)
sensitive to local substituents and "what is attached where"
good default for diverse ligand sets or when local chemistry matters

Use Morgan when:

donors/skeleton motifs vary across the dataset
substituent chemistry affects your property (sterics, electronics, solubility proxies)
you expect local modifications to matter more than global rearrangements

AtomPairs¶

emits features for all atom pairs: (distance, inv_i, inv_j)
very sensitive to global arrangements and connectivity distances
often useful when ligands share the same fragments but differ by how fragments are connected

Use AtomPairs when:

dataset contains many “same building blocks, different wiring”
you want to emphasize overall shape/topological distances over local environments
you want a simpler model (fewer knobs: distance range only)

Caveat: cTopo’s AtomPairs is "ctopo-native": it's not meant to be RDKit-identical.

3.2 `graph_view`: physically cut the graph before fingerprinting¶

graph_view selects which atoms are kept in the graph (then nodes are relabeled to 0..n-1, while original indices are preserved as node attribute idx).

Available options:

"original": keep everything (except metals if keep_metals=False).
"skeleton": keep DONOR + SKELETON atoms only.
"substituent": keep SUBSTITUENT atoms only.
"skeleton_alpha_substituents": keep skeleton plus one-bond "alpha" substituents attached to it.
"substituent_alpha_skeleton": keep substituents plus one-bond "alpha" skeleton attached to them.

You can also pass an iterable of original atom indices, but it's a rare case since the list of atom indices will be generally different for different ligands and you need to generate fingerprinter for each ligand separately.

When to use graph views

Use "skeleton" to focus on the coordination cage-defining part.
Use "skeleton_alpha_substituents" if you want a compromise: cage + immediate substitution pattern.
Use "substituent" only if you intentionally want to compare substituent chemistry independently of the cage.

3.3 `emit_from`: restrict emission without cutting the graph¶

emit_from controls which atoms are allowed to emit features, while the algorithm can still "see" the whole graph (depending on the algorithm and radius).

Accepted values:

None or "all": emit from all atoms
named regimes (based on atom_type):
- "skeleton" (includes donors by convention)
- "substituent"
- "donor"
- "center" (metals; complexes only)

As with graph views, you can pass an iterable of original atom indices, and it's also not the best solution in most cases.

Why this is useful

Compared to cutting (graph_view="skeleton"), emission restriction can keep contextual information.
Example: emit only from donors/skeleton, but still let substituents influence environments (Morgan radius will “reach out” if large enough).

Rule of thumb

If you want to ignore some atoms completely, use graph_view.
If you want those atoms to remain context but not be emission centers, use emit_from.

3.4 `atomic_properties`: atomic invariant construction (the most important knob)¶

Atomic invariants are hashed from node attributes. Allowed keys are:

atom_type, Z, degree, heavy_degree, num_pi_electrons, num_hs, charge, in_ring, aromatic

You can pass either:

a single ordered list (applied to all atom types), or
a dict mapping atom_type -> list[str] (fallback is DEFAULT_PROPERTIES)

Three practical presets (matching common RDKit-ish philosophies) with atom_type added:

In [5]:

Copied!





# RDKit-like “low info” invariant (closer to classic AtomPairs setups)
INV_SHORT = ('atom_type', 'Z', 'heavy_degree', 'num_pi_electrons')

# RDKit-like “atom-centric” invariant (closer to classic Morgan setups)
INV_CLASSIC = ('atom_type', 'Z', 'degree', 'num_hs', 'charge', 'in_ring')

# “physics-ish / valence-ish” invariant (this is close to cTopo DEFAULT_PROPERTIES)
INV_VALENCE = ('atom_type', 'Z', 'heavy_degree', 'num_pi_electrons', 'num_hs', 'in_ring')
# RDKit-like “low info” invariant (closer to classic AtomPairs setups)
INV_SHORT = ('atom_type', 'Z', 'heavy_degree', 'num_pi_electrons')

# RDKit-like “atom-centric” invariant (closer to classic Morgan setups)
INV_CLASSIC = ('atom_type', 'Z', 'degree', 'num_hs', 'charge', 'in_ring')

# “physics-ish / valence-ish” invariant (this is close to cTopo DEFAULT_PROPERTIES)
INV_VALENCE = ('atom_type', 'Z', 'heavy_degree', 'num_pi_electrons', 'num_hs', 'in_ring')

Atom-type–specific atomic properties (advanced)¶

In cTopo, atomic_properties can be provided not only as a single list/tuple (applied to all atoms), but also as a mapping from atom_type to a property list. This lets you emphasize different information for different ligand roles.

Typical use cases:

keep a richer invariant for DONOR atoms (include charge, aromatic)
keep a compact invariant for SUBSTITUENT atoms (avoid over-fragmenting by minor variations)
keep a metal-aware invariant for CENTER atoms in complexes (include charge)

In [6]:

Copied!





from ctopo.core.atom_types import AtomType
from ctopo.descriptors import make_fingerprinter, MorganSpec, DEFAULT_PROPERTIES

# Atom-type–specific invariant recipes
props_by_type = {
    # Donors: keep more chemistry (distinguish anionic vs neutral, aromatic vs aliphatic donors)
    int(AtomType.DONOR): ('atom_type', 'Z', 'degree', 'num_hs', 'charge', 'aromatic', 'in_ring'),

    # Skeleton: keep moderate detail (enough to distinguish aromatic vs aliphatic cages)
    int(AtomType.SKELETON): ('atom_type', 'Z', 'degree', 'num_hs', 'num_pi_electrons', 'in_ring'),

    # Substituents: intentionally coarse (avoid exploding feature space on peripheral variations)
    int(AtomType.SUBSTITUENT): ('atom_type', 'Z'),
}

# Any atom types not listed fall back to DEFAULT_PROPERTIES
fp = make_fingerprinter(
    kind='morgan',
    spec=MorganSpec(radius=2, use_bond_types=True),
    atomic_properties=props_by_type,
    graph_view='original',
    emit_from='skeleton',
    bond_mode='skeleton_only',
    output='folded_counts',
    fp_size=2048,
)
from ctopo.core.atom_types import AtomType
from ctopo.descriptors import make_fingerprinter, MorganSpec, DEFAULT_PROPERTIES

# Atom-type–specific invariant recipes
props_by_type = {
    # Donors: keep more chemistry (distinguish anionic vs neutral, aromatic vs aliphatic donors)
    int(AtomType.DONOR): ('atom_type', 'Z', 'degree', 'num_hs', 'charge', 'aromatic', 'in_ring'),

    # Skeleton: keep moderate detail (enough to distinguish aromatic vs aliphatic cages)
    int(AtomType.SKELETON): ('atom_type', 'Z', 'degree', 'num_hs', 'num_pi_electrons', 'in_ring'),

    # Substituents: intentionally coarse (avoid exploding feature space on peripheral variations)
    int(AtomType.SUBSTITUENT): ('atom_type', 'Z'),
}

# Any atom types not listed fall back to DEFAULT_PROPERTIES
fp = make_fingerprinter(
    kind='morgan',
    spec=MorganSpec(radius=2, use_bond_types=True),
    atomic_properties=props_by_type,
    graph_view='original',
    emit_from='skeleton',
    bond_mode='skeleton_only',
    output='folded_counts',
    fp_size=2048,
)

This pattern is especially useful when you want role-aware fingerprints: donors and cage atoms carry chemically meaningful information, while substituents are treated more coarsely (or vice versa, depending on your task).

Advice

Always include atom_type for your use case (ligand roles + complexes).
Add charge when you care about anionic donors vs neutral donors.
Add aromatic if aromaticity is meaningful in your dataset (e.g., pyridine vs amine donors).

Important implementation note

The property list is ordered and the hash depends on that order. So keep it fixed for reproducibility.
Changing order is not “wrong”, it’s just a different invariant definition.

3.5 Bond handling: `bond_mode` + Morgan’s `use_bond_types`¶

bond_mode writes a dedicated edge attribute used by Morgan as “bond code”:

"all": keep all bond types
"all_single": treat all bonds as single
"skeleton_only": keep bond types only when both atoms are not substituents
"substituent_only": keep bond types only when at least one atom is a substituent

This mainly matters if:

you use Morgan, and
MorganSpec.use_bond_types=True (default)

Advice

Want the cage chemistry but don’t want substituent bond-detail noise? Use bond_mode="skeleton_only"
Want pure topology-ish local environments (ignore bond orders everywhere)? Use bond_mode="all_single"
AtomPairs doesn’t use bond types (it uses distances), so bond_mode is mostly irrelevant there.

3.6 Output formats: `sparse_counts` vs `folded_counts` vs `bits`¶

cTopo’s canonical internal representation is sparse counts: {feature_id: count}.

output="sparse_counts": best for analysis, debugging, and if you want to do your own vectorization.
output="folded_counts": hash-folds features to a fixed size (fp_size) and keeps counts. Has two formats:
- folded_format="dict": bit -> count dictionary;
- "numpy": dense array.
output="bits": hash-folds to fp_size and keeps only presence/absence. Has three formats:
- bits_format="set": (fast for Tanimoto);
- "dict": bit -> presence/absence dictionary;
- "numpy": dense 0/1 array.

Advice

Similarity search / clustering: use bits (and Tanimoto)
Linear ML models that like counts: use folded_counts (often numpy)
Inspectability / research prototyping: use sparse_counts

3.7 What else matters (often overlooked)¶

keep_metals (Complexes): if you want metal identity to participate in fingerprints, use graph_view="original" and keep_metals=True and include emit_from="center" or allow all. If you want “ligand-only” features, set keep_metals=False.
Morgan-specific knobs (MorganSpec)
- radius: biggest effect after invariants
- use_chirality: usually irrelevant for your topology/skeleton-centric use, but useful for chiral ligands
- include_redundant_environments: normally keep default (False)
bit_info (Morgan provenance): you can pass a dict to collect “which atoms/radius produced bit X” — very useful for debugging, but can become large.

4. Where this becomes useful¶

For dataset analysis and ML you can build feature sets like:

skeleton-only fingerprint (geometry / cage tendency)
substituent-only fingerprint (tuning solubility, sterics, electronics)
concatenation of both

Takeaways¶

cTopo can fingerprint different structural roles separately.
Skeleton-focused features are natural for coordination chemistry.

Next: Tutorial 05 covers complexes (dative bond convention) and extracting ligands from complex datasets.

Tutorial 04 — Role-aware fingerprints (skeleton vs full ligand)¶

Prerequisites¶

1. Two ligands that share the same skeleton but differ in substituents¶

2. Configure Morgan fingerprints¶

3. Fingerprinter settings: what exists and when to use what¶

3.1 Fingerprint kind: morgan vs atompairs¶

Morgan (ECFP-style)¶

AtomPairs¶

3.2 graph_view: physically cut the graph before fingerprinting¶

3.3 emit_from: restrict emission without cutting the graph¶

3.4 atomic_properties: atomic invariant construction (the most important knob)¶

Atom-type–specific atomic properties (advanced)¶

3.5 Bond handling: bond_mode + Morgan’s use_bond_types¶

3.6 Output formats: sparse_counts vs folded_counts vs bits¶

3.7 What else matters (often overlooked)¶

4. Where this becomes useful¶

Takeaways¶

3.1 Fingerprint kind: `morgan` vs `atompairs`¶

3.2 `graph_view`: physically cut the graph before fingerprinting¶

3.3 `emit_from`: restrict emission without cutting the graph¶

3.4 `atomic_properties`: atomic invariant construction (the most important knob)¶

3.5 Bond handling: `bond_mode` + Morgan’s `use_bond_types`¶

3.6 Output formats: `sparse_counts` vs `folded_counts` vs `bits`¶