Passing MongoDB data modeling
One of the most advertised features of MongoDB is its flexibility. Flexibility, however, is a double-edged sword. More flexibility means more choices when modeling data (this reminds me of Zen Python: ” There must be one — and preferably only one — obvious way to do it “). However, I like the flexibility that MongoDB provides, it’s just important to consider some best practices before choosing a data model.
THIS IS A PROBLEM
In this post, we will look at how we modeled the mailing lists and the people who belong to them. Here are the requirements:
- Each person can have one or more e-mail addresses.
- Every person can belong to any number of mailing lists.
- Everyone who belongs to a mailing list can choose which name he wants to use for the mailing list.
These requirements are obviously somewhat simplified, but they are enough to express the mechanics of the kernel on which Fiesta is based.
Let’s take a look at what our data model looks like if we never embed anything — we’ll call it a 0 investment strategy.
We have People who have a name and a password:
name: "Mike Dirolf"
pw: "Some Hashed Password"
We have a separate collection of addresses, where each address contains a link to one person:
We have groups, each of which is basically just an identifier (the IRL has some more group-specific metadata that will also be here, but we are going to ignore them to focus on the relationship):
Finally, we have a membership that links a person to the group. Each membership includes the name of the list that the person uses for the group and a link to the address where they want to receive mail for that group:
This data model is easy to design, easy to think about, and easy to maintain. However, we mainly model data as in the DBMS; we do not use the document-based MongoDB approach. For example, let’s see how we can get the addresses of other members of a group, considering one incoming address and the group name (this is a very common query for Fiesta):
- Request a collection of addresses to obtain an ID for the person concerned.
- Request membership in a collection with a person’s identifier from step 1 and the group name to obtain a group identifier.
- Request a Memberships collection again to obtain all memberships with a group identifier from step 2.
- Request an address collection to obtain the address that will be used for each of the participants in step 3.
The strategy that many novices use when modeling their data, we will call the strategy of implementing everything. To use this strategy for Fiesta, we would take all group memberships and implement them directly in the group document. We would also embed address and person metadata directly into each membership:
name: "Mike Dirolf",
pw: "Some Hashed Password",
person_addresses = ["firstname.lastname@example.org", "email@example.com", ...],
The theory behind the Embed All strategy is that by storing all related data in one place, we can simplify common requests. With this strategy, the query we have done above is trivial (remember that the query is “given address and group name, what are the other addresses of the group members”):
- Query the Groups collection for a group containing a membership with the same address as the person_addresses and group_name.
- Go through the resulting document to get the other addresses of the members.
This is as easy as it can be. But what if we want to change a person’s name or password? We would have to change it in every built-in membership.
The same goes for adding a new person_address or removing an existing one. This underscores the characteristics of the embed everything model: it can be useful to perform one specific request (because we mostly do pre-join), but it can be a nightmare for long-term maintainability. I highly recommend this approach in general.
INSERT TRIVIAL THINGS
The approach we used in Fiesta and the approach I most often recommend is to start by thinking about the 0-embed model. Once you’ve figured out this model, you can choose simple cases where embedding makes sense. In most cases, these cases tend to be too many relationships.
For example, each of our addresses belongs to one user (and is also referred to by members). Addresses are also unlikely to change very often. Let’s build them into an array to save some queries and synchronize our data model with our human mental model.
Each membership is linked to one person and one group, so we could imagine that they are built into either the person model or the group. In such cases, it is important to consider both the data access patterns and the size of the relationships.
We expect people to have no more than 1000 memberships in groups and groups to have no more than 1000 memberships, so the size does not tell us much. However, our access template — when we display the Fiesta control panel, we need access to all memberships of a person.
To simplify this request, let’s build membership into the Person model. It also has the advantage that all person addresses are stored in the Person model (because they are referenced both at the highest level and within the membership). If an address needs to be deleted or changed, we can do it all in one place.
This is what it looks like now (this is the Person model — the only other model is Group, which is identical to the 0-embed case):
name: "Mike Dirolf",
pw: "Some Hashed Password",
addresses: ["firstname.lastname@example.org", "email@example.com", ...],
The request that we discussed now looks like this:
- Request for a user with the appropriate address and built-in membership with the correct group name.
- Use the identifier of the group in the built-in membership, starting with step 1, to request other people who have membership in that group to get the addresses directly from their built-in membership.
This is still almost as easy as inline, but our data model is much cleaner and easier to maintain. We hope this step by step guide was useful — if you have any questions, let us know!