GraphQL, an emerging standard for querying multiple types of data systems, is shifting the debate about how we access and transform data, and may very well have as much of an impact or more as SQL did forty years ago.
In the fifty odd years that databases have been around, a very consistent pattern has emerged. A new company will develop an indexing strategy that sits at the heart of a database approach or strategy. If that company proves successful in establishing itself, a second company, then a third, will provide their own competing strategies, each of which in turn also creates a query language or API to access the data.
Innovate, Dominate, Standardize
Eventually progress in that space stalls as every vendor now has their own language for querying and updating content. At that point, either a standards group or a powerful non-market player step in with a solution that is sufficiently robust that vendors begin to incorporate it in as a “secondary” query. In time, this becomes the de-facto query language, though depending upon the language there’s still a certain degree of customization that can occur that creates potential vendor lock-in.
The SQL language for querying relational data was perhaps the Ur example of this, and even several decades and several millions of databases later SQL has become one of the most universal query languages in use. The language evolved in stages, with the query portion of the language emerging first, then the data definition layer for creating new content.
XML followed this pattern in the late 1990s and early 2000s, with the definition of the language itself, a path language for identifying resource (XPath) and a transformation language (XSLT), followed by the XML Schema Definition Language (XSD) and finally XQuery for performing query.
The Semantics space saw a foundational language (RDF), an inferential language (RDFS and OWL) and a query language (SPARQL) in its original implementation, and just within the last few years a revision of SPARQL, an update language (SPARQL UPDATE), and a second schematic language (SHACL) have made interacting with RDF remarkably standardized.
In all three cases, the benefits of standardization became obvious quickly. Query languages against databases in particular take as much as three years to fully integrate into products and open source projects, and when there are a large number of query languages scattered across multiple databases, this means that developers have to decide whether or not it is worth committing to learning that language. Low developer adoption rate has, in many cases, spelled the difference between a given vendor having a successful product versus one that goes nowhere, and few developers will adopt a query language (a process that can often take months or even years) if they don’t feel that it will lead to potential work.
The Emergence of JSON Databases
The JSON database community in particular has been plagued with this problem for some time now. XML was one of the first general data languages, based upon its own origins in the structured generalized markup language (SGML) of the 1960s, though ironically, even as XML managed to largely replace SGML as a way of representing document formats, its “angle-bracket” and complex rule set was frequently derided by web developers who felt that it was too heavy for the kind of applications that they were writing.
XML Databases first appeared a few years after XML itself was standardized, though XML databases never achieved a huge penetration outside of publishers and government agencies, with the biggest vendors being MarkLogic, eXistDB, and BaseX. MarkLogic still overwhelmingly dominated this market, though they have long since expanded to incorporate JSON as well.
Similar patterns hold for JSON, where Couchbase (originally MemBase) was released in 2010, and MongoDB in 2011. For a while, the term NoSQL ended up describing just the JSON databases, though it was expanded to incorporate most non-traditional databases, while XML and JSON collectively also became known as document databases. This term comes not necessarily because they are used to store narrative content, but because the content contained therein was hierarchical, rather than tabular, which meant among other things that the complex primary key/foreign key negotiations that SQL used were largely eliminated.
This also meant that at any given point you could describe one or more nodes of information through a path starting from a single root, in much the same way as you could describe a branchpoint or leaf in a tree based upon its path to the trunk of the tree. This path notation has always been significant, because in order to work with a resource in a hierarchical structure you have to find it, and the path in turn specified an address.
As first out of the chute, either Couchbase or MongoDB could have potentially specified a consistent query language, but neither were able to wrest enough market-share to decisively set a standard, and culture that tended to be militantly anti-XML rejected path-like solutions, even though they had turned out to be one of the foundational elements that made XML initially so powerful. In general the thinking was that developers would far rather make use of SQL-like solutions for specifying content, though in practice, SQL, with its primary/foreign key architecture, was actually fairly marginally as a query language for document-oriented content.
Facebook Introduces GraphQL
In 2015, architects at Facebook made several astute observations. The first was that their infrastructure of users, posts, media and so forth was too complex for most relational data systems, and more closely resembled that of the emerging NoSQL databases. However, much of the information that was critical to the success of Facebook came about due to relationships between documents, something that most document-centric databases had, at best, very weak tools to handle.
Instead, what they had seemed best describable as Linked Data. However, because so much of their infrastructure was built on top of JSON, the cost and complexity of moving to an RDF solution was simply not feasible. Instead, they released a new language called GraphQL. Graphs are connection maps, and they have one key advantage over both relational and document stores – you can express both relational and hierarchical information as graphs.
In essence, a graph is a collection of assertions. Each assertion has a subject which represents the thing being described, a predicate which gives the relationship, and an object which is either a scalar value (it can be represented as a string with some additional metadata) or a link to another thing. There’s a whole branch of mathematics devoted to network graph theory (often shorted just to graph theory) and it is a topic that most computer science courses at least touch on, if necessarily to the depth that it should be.
The Facebook knowledge graph made use of global identifiers for every resource. If you knew that identifier, you could retrieve all of the data for that resource. You could also query that data to find out what pointed to that resource, and from that could find those resources that “knew about” that resource. For instance, if a property existed that identified those users who saw a given user as being a friend, then looking for all assertions of the form ?user has-friend Jane would retrieve the network around Jane consisting of Jane’s friends.
What is most significant about this is that in a graph, none of those friends need to actually retain anything about Jane beyond her identifier. This is a notion of data by reference (by-ref) and it’s significant because you need only store information that’s pertinent specifically to Jane once. If she changes her last name from “Jane Doe” to “Jane Jones”, because this information is stored by-ref, any time a program shows John Deere’s friend’s information, that information will automatically be updated without having to explicitly change John’s data.
GraphQL works by defining a schema that describes how a given resource is put together.
Once this is known, the GraphQL API acceptia a template based upon that resource which describes the output you want to generate. The output takes into account both single instance and multiple instance (array) situations, which is one of the biggest problems that SQL runs into.
GraphQL and Graph Databases
When I went to Graphorum 2019 in Chicago, GraphQL was one of the hottest topics of discussion. From the perspective of JSON databases, GraphQL offers a way to perform consistent queries across multiple potential JSON stores. However, GraphQL offers a much more powerful set of tools, while simplifying one of the biggest problems that triple stores have.
SPARQL provides two different ways to query RDF data. One type of query returns a form similar to that of SQL, which is to say, tabular content based upon joins. The second type of query, on the other hand, returns a set of assertions that either are the results of the various joins, or are calculated. While both forms can be converted to JSON, the paradigm to do so is neither obvious nor without a steep learning curve to make happen.
Already, a number of graph store vendors have implemented GraphQL as alternatives to their native query languages, including Cambridge Semantic’s Anzo database, the Stardog triple store, Neo4J, OntoText’s GraphDB and TopQuadrant’s TopBraid store. Amazon’s Neptune can be configured to work with GraphQL through the Amazon App.
Beyond the graph database space, GraphQL has been adapted to work with a number of SQL databases as well, both in its query and mutation (update) form. This can dramatically cut down on the requirements for specialized REST API calls. This means that a single endpoint can replace potentially thousands of microservices, dramatically reducing the complexity of networked applications in the process.
Longer term, GraphQL may very well end up being a major component in moving towards a data driven enterprise architecture. It will make it much easier to build applications against enterprise data systems, as it becomes possible to more loosely couple data sources with client applications. This in turn should simplify integration costs, reduce the overall redundancy and duplication in enterprise systems, and reduce overall IT costs significantly.