2.1. The Neo4j Graph Database

A graph database stores data in a graph, the most generic of data structures, capable of elegantly representing any kind of data in a highly accessible way.

For terminology around graph databases, see Terminology.

Here’s an example graph which we will approach step by step in the following sections:

Nodes

A graph records data in nodes and relationships. Both can have properties. This is sometimes referred to as the Property Graph Model.

The fundamental units that form a graph are nodes and relationships. In Neo4j, both nodes and relationships can contain properties.

Nodes are often used to represent entities, but depending on the domain relationships may be used for that purpose as well.

Apart from properties and relationships, nodes can also be labeled with zero or more labels.

The simplest possible graph is a single Node. A Node can have zero or more named values referred to as properties. Let’s start out with one node that has a single property named title:

The next step is to have multiple nodes. Let’s add two more nodes and one more property on the node in the previous example:

Relationships

Relationships organize the nodes by connecting them. A relationship connects two nodes — a start node and an end node. Just like nodes, relationships can have properties.

Relationships between nodes are a key part of a graph database. They allow for finding related data. Just like nodes, relationships can have properties.

A relationship connects two nodes, and is guaranteed to have valid start and end nodes.

Relationships organize nodes into arbitrary structures, allowing a graph to resemble a list, a tree, a map, or a compound entity — any of which can be combined into yet more complex, richly inter-connected structures.

Our example graph will make a lot more sense once we add relationships to it:

Our example uses ACTED_IN and DIRECTED as relationship types. The roles property on the ACTED_IN relationship has an array value with a single item in it.

Below is an ACTED_IN relationship, with the Tom Hanks node as start node and Forrest Gump as end node.

You could also say that the Tom Hanks node has an outgoing relationship, while the Forrest Gump node has an incoming relationship.

[Note]Relationships are equally well traversed in either direction.

This means that there is no need to add duplicate relationships in the opposite direction (with regard to traversal or performance).

While relationships always have a direction, you can ignore the direction where it is not useful in your application.

Note that a node can have relationships to itself as well:

The example above would mean that Tom Hanks KNOWS himself.

To further enhance graph traversal all relationships have a relationship type.

Let’s have a look at what can be found by simply following the relationships of a node in our example graph:

Using relationship direction and type

What we want to know Start from Relationship type Direction

get actors in movie

movie node

ACTED_IN

incoming

get movies with actor

person node

ACTED_IN

outgoing

get directors of movie

movie node

DIRECTED

incoming

get movies directed by

person node

DIRECTED

outgoing

Properties

Both nodes and relationships can have properties.

Properties are named values where the name is a string. The supported property values are:

  • Numeric values,
  • String values,
  • Boolean values,
  • Lists of any other type of value.
[Note]NULL is not a valid property value.

NULLs can instead be modeled by the absence of a key.

For further details on supported property values, see Section 32.3, “Property values”.

Labels

Labels assign roles or types to nodes.

A label is a named graph construct that is used to group nodes into sets; all nodes labeled with the same label belongs to the same set. Many database queries can work with these sets instead of the whole graph, making queries easier to write and more efficient to execute. A node may be labeled with any number of labels, including none, making labels an optional addition to the graph.

Labels are used when defining constraints and adding indexes for properties (see the section called “Schema”).

An example would be a label named User that you label all your nodes representing users with. With that in place, you can ask Neo4j to perform operations only on your user nodes, such as finding all users with a given name.

However, you can use labels for much more. For instance, since labels can be added and removed during runtime, they can be used to mark temporary states for your nodes. You might create an Offline label for phones that are offline, a Happy label for happy pets, and so on.

In our example, we’ll add Person and Movie labels to our graph:

A node can have multiple labels, let’s add an Actor label to the Tom Hanks node.

Label names

Any non-empty Unicode string can be used as a label name. In Cypher, you may need to use the backtick (`) syntax to avoid clashes with Cypher identifier rules or to allow non-alphanumeric characters in a label. By convention, labels are written with CamelCase notation, with the first letter in upper case. For instance, User or CarOwner.

Labels have an id space of an int, meaning the maximum number of labels the database can contain is roughly 2 billion.

Traversal

A traversal navigates through a graph to find paths.

A traversal is how you query a graph, navigating from starting nodes to related nodes, finding answers to questions like “what music do my friends like that I don’t yet own,” or “if this power supply goes down, what web services are affected?”

Traversing a graph means visiting its nodes, following relationships according to some rules. In most cases only a subgraph is visited, as you already know where in the graph the interesting nodes and relationships are found.

Cypher provides a declarative way to query the graph powered by traversals and other techniques. See Cypher Query Language for more information.

When writing server plugins or using Neo4j embedded, Neo4j provides a callback based traversal API which lets you specify the traversal rules. At a basic level there’s a choice between traversing breadth- or depth-first.

If we want to find out which movies Tom Hanks acted in according to our tiny example database the traversal would start from the Tom Hanks node, follow any ACTED_IN relationships connected to the node, and end up with Forrest Gump as the result (see the dashed lines):

Paths

A path is one or more nodes with connecting relationships, typically retrieved as a query or traversal result.

In the previous example, the traversal result could be returned as a path:

The path above has length one.

The shortest possible path has length zero — that is it contains only a single node and no relationships — and can look like this:

This path has length one:

Schema

Neo4j is a schema-optional graph database.

You can use Neo4j without any schema. Optionally you can introduce it in order to gain performance or modeling benefits. This allows a way of working where the schema does not get in your way until you are at a stage where you want to reap the benefits of having one.

[Note]Note

Schema commands can only be applied on the master machine in a Neo4j cluster (see Chapter 25, High Availability). If you apply them on a slave you will receive a Neo.ClientError.Transaction.InvalidType error code (see Section 20.2, “Neo4j Status Codes”).

Indexes

Performance is gained by creating indexes, which improve the speed of looking up nodes in the database.

[Note]Note

This feature was introduced in Neo4j 2.0, and is not the same as the legacy indexes (see Chapter 34, Legacy Indexing).

Once you’ve specified which properties to index, Neo4j will make sure your indexes are kept up to date as your graph evolves. Any operation that looks up nodes by the newly indexed properties will see a significant performance boost.

Indexes in Neo4j are eventually available. That means that when you first create an index the operation returns immediately. The index is populating in the background and so is not immediately available for querying. When the index has been fully populated it will eventually come online. That means that it is now ready to be used in queries.

If something should go wrong with the index, it can end up in a failed state. When it is failed, it will not be used to speed up queries. To rebuild it, you can drop and recreate the index. Look at logs for clues about the failure.

You can track the status of your index by asking for the index state through the API you are using. Note, however, that this is not yet possible through Cypher.

How to use indexes through the different APIs:

Constraints

[Note]Note

This feature was introduced in Neo4j 2.0.

Neo4j can help you keep your data clean. It does so using constraints, that allow you to specify the rules for what your data should look like. Any changes that break these rules will be denied.

In this version, unique constraints is the only available constraint type.

How to use constraints through the different APIs: