Skip to content

Neo4j property graph

analysis.json is one self-contained file: to query it you load the whole thing into memory and walk it. That works for a single project and falls over across a portfolio. --emit neo4j projects the same in-memory PyApplication — same symbol table, same call graph — into a labeled property graph, so many applications live in one database and you query across all of them with Cypher instead of parsing giant JSON blobs.

PY_HAS_MODULE PY_DECLARES PY_HAS_ATTRIBUTE PY_HAS_METHOD PY_DECORATED_BY PY_HAS_CALLSITE PY_RESOLVES_TO PY_CALLS :PyApplication name schema_version :PyModule module_name content_hash :PyClass :PySymbol name base_classes :PyAttribute name type :PyCallable :PySymbol signature cyclomatic_complexity :PyDecorator name :PyCallSite method_name receiver_type :PyExternal name module
The analysis is a Neo4j property graph: every node carries a label (its color) and properties; every relationship carries a type. The dashed ring marks an entrypoint; the PY_CALLS edge is the resolved call graph.

This guide covers both ways to populate that graph (a self-contained snapshot and a live incremental push), how --app-name keeps many applications safely in one database, the version-stamped schema contract, and how the CLDK Python SDK reads the graph back without re-analyzing anything. For the full node-and-relationship topology, see the output schema reference.

--emit neo4j is an alternative to the default analysis.json, selected by the --emit enum: canpy builds one analysis in memory and then projects it. The mapping is faithful — it is the same model, not a new analysis:

  • Labels are namespaced. Every node label is Py-prefixed and every relationship type is PY_-prefixed — :PyModule, :PyClass, :PyCallable, PY_CALLS, PY_DECLARES — so the Java, TypeScript, and Python analyzers can share one database without label or relationship-type collisions.
  • Declarations are keyed by signature. :PyClass, :PyCallable, and :PyExternal are all MERGEd under a shared :PySymbol label keyed by signature — the very identity used in the symbol table and call graph. That is what lets call edges, inheritance, and declaration containment reference a symbol without duplicating it.
  • Ghost nodes become :PyExternal. Third-party and RPC endpoints that the in-memory model keeps as ghost nodes are materialized authoritatively as :PyExternal nodes, carrying name and module. A PyCallSite resolves via PY_RESOLVES_TO to either a real :PyCallable or an external, and PY_CALLS edges to externals survive the projection.
  • One application, one anchor. Everything hangs off a single :PyApplication node whose name is your --app-name. That node also carries schema_version so a consumer can check the contract it is reading against.
flowchart TB
    APP["PyApplication (in memory)"]
    APP -->|"--emit json"| J["analysis.json / .msgpack"]
    APP -.->|"--emit neo4j"| PG["Labeled property graph"]
    PG --> SNAP["graph.cypher snapshot<br/>(no --neo4j-uri)"]
    PG --> BOLT["live Bolt push<br/>(--neo4j-uri, incremental)"]
    APP -->|"--emit schema"| SCH["schema.json contract"]

The graph splits analysis into two independent halves. The producer is a canpy --emit neo4j run — the heavy step that walks source, resolves with Jedi (and optionally CodeQL), and writes the graph. It runs out of band: a CI step, or a Kubernetes Job or CronJob on each commit, pushing app-scoped subgraphs over Bolt into a shared Neo4j.

The consumers — agents, dashboards, and the CLDK Python SDK — are lightweight read-only clients. They never run the analyzer; they only query the graph. Because the push is incremental and app-scoped, many producer jobs write into one cluster while many consumers fan out from it, and the two scale independently.

flowchart LR
    subgraph Producers["Producers (out of band)"]
      P1["canpy --emit neo4j<br/>service-a"]
      P2["canpy --emit neo4j<br/>service-b"]
    end
    DB[("Neo4j<br/>(shared cluster)")]
    subgraph Consumers["Consumers (read-only)"]
      C1["CLDK SDK"]
      C2["agents"]
      C3["dashboards"]
    end
    P1 -->|"Bolt, incremental"| DB
    P2 -->|"Bolt, incremental"| DB
    DB --> C1
    DB --> C2
    DB --> C3

--app-name sets the name of the single :PyApplication root node for this graph. It is the merge key (uniqueness-constrained), and everything else hangs off it via PY_HAS_MODULE. When omitted it defaults to the basename of the resolved --input directory:

Terminal window
canpy --input ./my-service --emit neo4j --app-name my-service
# the :PyApplication anchor is named "my-service"

The anchor name also scopes every graph mutation, which is what makes one shared database multi-tenant by construction — applications never clobber each other:

  • The graph.cypher snapshot wipes only (:PyApplication {name: <app>}) and its module subtree before reloading.
  • The Bolt orphan prune on a full run is scoped to (:PyApplication {name: $app})-[:PY_HAS_MODULE]->(:PyModule), so pushing app B never deletes app A’s modules.

--app-name is also the value the CLDK Python SDK matches via application_name to read back exactly this app’s subgraph. Keep --app-name (CLI) and application_name (SDK) identical.

--emit neo4j has two sub-modes, decided solely by whether --neo4j-uri is set.

Without --neo4j-uri, canpy writes a self-contained graph.cypher file: the constraints and indexes, a scoped DETACH DELETE of this app’s prior subgraph, then batched UNWIND … MERGE statements for every node and edge. It needs no extra dependencies and expresses the full truth of the analysis (it is not incremental). With --output the file lands in that directory; otherwise it is written to the current directory.

Terminal window
canpy --input ./my-service --emit neo4j --app-name my-service --output ./out
# -> ./out/graph.cypher

Load it into Neo4j with cypher-shell:

Terminal window
cypher-shell -u neo4j -p "$NEO4J_PASSWORD" < ./out/graph.cypher

This path is ideal for committing a reproducible snapshot as a CI artifact, seeding a local database, or loading a graph offline with no driver installed.

On a Bolt push, adding --file-name makes the run targeted rather than a full run. A targeted run rewrites only that file’s module and skips orphan pruning — modules for deleted files are not removed. A full run (no --file-name) enables pruning of vanished modules.

Terminal window
# Targeted: re-push one changed file, leave everything else (no pruning)
canpy --input ./my-service --emit neo4j --app-name my-service \
--neo4j-uri bolt://localhost:7687 --file-name src/app/routes.py
# Full run: re-analyze the whole project and prune modules whose files are gone
canpy --input ./my-service --emit neo4j --app-name my-service \
--neo4j-uri bolt://localhost:7687

A natural pattern is a targeted push per changed file in a fast pre-merge hook, and a scheduled full run that reconciles deletions.

Because the analyzer is the producer half and writes app-scoped subgraphs, it fits a Job (one-shot reconciliation) or a CronJob (periodic re-analysis) that pushes over Bolt into a managed or clustered Neo4j. Supply the connection through the standard environment variables — read the password from a Secret so it never lands on the command line:

apiVersion: batch/v1
kind: CronJob
metadata:
name: analyze-my-service
spec:
schedule: "*/30 * * * *"
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
containers:
- name: canpy
image: ghcr.io/codellm-devkit/codeanalyzer-python:latest
args:
- --input=/src
- --emit=neo4j
- --app-name=my-service
- --no-venv
- -v
env:
- name: NEO4J_URI
value: bolt://neo4j.data:7687
- name: NEO4J_USERNAME
value: neo4j
- name: NEO4J_PASSWORD
valueFrom:
secretKeyRef:
name: neo4j-auth
key: password
volumeMounts:
- name: source
mountPath: /src
volumes:
- name: source
# a checkout of the project under analysis
emptyDir: {}

--no-venv resolves imports against the ambient interpreter instead of building a per-project virtualenv, which is the right default in a container where dependencies are already installed. Each app’s CronJob writes only its own anchored subgraph, so dozens of services can target one Neo4j cluster; give the analyzer write credentials and your consumers read-only ones.

Every graph carries a schema_version stamped on its :PyApplication node — currently 1.1.0 — so a consumer can check the contract before it reads. The machine-readable contract itself is project-independent, so you can publish it without running any analysis:

Terminal window
# Print the schema contract to stdout...
canpy --emit schema
# ...or write it to a directory as schema.json
canpy --emit schema --output ./out
# -> ./out/schema.json

schema.json enumerates every node label, relationship type, and property the emitter can produce. It is checked into the repository as schema.neo4j.json and shipped as a GitHub Release asset, so a downstream tool can pin the version it was built against. See the output schema reference for the data model behind it.

Once the graph is populated, the CLDK Python SDK reads it back without re-analyzing — no JDK, no native binary, and no project source on the consumer. The graph is produced once, out of band, by the canpy --emit neo4j job above; the SDK is a read-only client that only needs the Bolt URI and read-only credentials. This is the enterprise unlock: analysis is produced once, centrally, and read cheaply everywhere.

  1. Install the SDK with its driver extra:

    Terminal window
    pip install 'cldk[neo4j]'
  2. Pass a Neo4jConnectionConfig as the backend. Its application_name must match the --app-name the graph was loaded with:

    from cldk import CLDK
    from cldk.analysis.commons.backend_config import Neo4jConnectionConfig
    analysis = CLDK.python(
    backend=Neo4jConnectionConfig(
    uri="bolt://localhost:7687",
    username="neo4j",
    password="neo4j", # read-only credentials suffice
    database=None, # None => server default
    application_name="my-service", # matches canpy --app-name
    ),
    )
    classes = analysis.get_classes() # Dict[str, PyClass]
    cg = analysis.get_call_graph() # networkx.DiGraph keyed by callable signatures
    for sig, cls in classes.items():
    print(sig, list(cls.methods))

Selecting the backend by the type of the backend= config is the whole switch: a Neo4jConnectionConfig swaps the facade onto the read-only Neo4j backend, while the default config runs the in-process analyzer. The Neo4j backend bulk-fetches nodes and relationships in a handful of Cypher queries and rebuilds the same PyApplication (the PyModule symbol table plus the PyCallEdge call graph) and the same networkx DiGraph the in-process analyzer produces. So get_symbol_table(), get_call_graph(), get_modules(), get_classes(), get_class(), get_methods(), get_callers(), get_callees(), and get_imports() all return the identical typed model objects.

Because the graph is external, project_path is optional for the Neo4j backend. The backend is a context manager — use with, or call .close() to release the driver:

with CLDK.python(backend=Neo4jConnectionConfig(
uri="bolt://localhost:7687",
application_name="my-service")) as analysis:
callers = analysis.get_callers("my_pkg.parser.Parser.parse")

You do not have to go through the SDK — the graph is plain Cypher. A few examples against the schema:

// All applications in this database and their schema version
MATCH (a:PyApplication)
RETURN a.name AS app, a.schema_version AS schema;
// The most complex callables across the whole portfolio
MATCH (a:PyApplication)-[:PY_HAS_MODULE]->(:PyModule)
-[:PY_DECLARES*]->(c:PyCallable)
RETURN a.name AS app, c.signature, c.cyclomatic_complexity AS cc
ORDER BY cc DESC LIMIT 20;
// Which applications call into a given external symbol
MATCH (a:PyApplication)-[:PY_HAS_MODULE]->(:PyModule)
-[:PY_DECLARES*]->(:PyCallable)-[:PY_CALLS]->(e:PyExternal)
WHERE e.module = 'requests'
RETURN DISTINCT a.name AS app;

Because every label is Py-prefixed and the Java and TypeScript analyzers use their own prefixes, these queries are unambiguous even when all three languages share one database.