For software developers the world over, it has almost become idiomatic to separate data from code by putting the data in a config file. I question that wisdom. I think there are cases for shunting data into config files, but also good reasons to leave data in code.
In a recent codebase I worked in, we had config files to map territories to countries. For example, the config file had a JSON structure that basically said Puerto Rico, Guam and other territories are part of the United States; Bouvet Island, Jan Mayen and Svalbard are part of Norway; and so on. Another config file captured the fact that France, Germany, Sweden, etc. are part of the European Economic Area (EEA). Even more configs: the languages primarily spoken in each country, the currency in each country, and so on. You can imagine parallels like these in other projects: colour names for RGB values, car models for each brand, vegetables classified as tubers, macOS version names, and so on.
These config files capture something about the outside world, not operating characteristics of the platform such as hostname, port and database connections. These facts change infrequently, as opposed to say, user interface messages. Moreover, these facts are not subject to “opinions” about how your platform operates, unlike say, thresholds at which alerts fire. Putting these facts in config files that are deployed along with your code, as conventional wisdom dictates, is actually complexifying your application, and making it more, not less brittle.
Consider a simple application. Say I want to retrieve the currency for a given country. Let’s get the basics out of the way. We will assume the country will be specified by an ISO 3166-1 Alpha 2 code, e.g., US for the United States, GB for Great Britain, IN for India, and so on. Likewise, we will assume the currency will be specified by an ISO 4217 code, e.g., USD for the US dollar, EUR for the Euro, INR for the Indian rupee, and so on.
The naïve me from a quarter century ago would have created a map data structure in code, mapping a String (for the country code) to a String (for the currency code). Add in some boilerplate code for getters, prints, debug, etc. and you’re done. Writing the code would take me about half an hour. Populating the data structure in the code with the first dozen or so entries would take a few minutes. Compiling the code would take seconds. Maybe write a few tests, another hour. And… done!
Done? Not quite. What about maintenance? Ah, fair. I do need to add more countries. And countries do change their currencies. Sometimes countries may drop out of existence as well. Each one of those changes would require me to go back to the code and change it. Which would require recompiling the project containing this code. That’s a pain.
A more modern approach is to extract all of that data out of the code and put it in a config file. In times past, this file may have been a tab-delimited file, then perhaps XML, but today, it’s likely JSON. Or, we could put all of this information in a database. But writing database code seems overkill for this application. Besides, I want to talk about config files for now, so JSON it is.
My code is much simpler now. I read the file, parse it, stuff the contents into a map data structure, again mapping a String to a String. Add in the same boilerplate as before, and you’re done. Writing this code would take a bit longer, maybe an hour or slightly more, so I could deal with file exceptions, parsing, etc. Compiling and testing is ballpark, the same amount of time.
Ostensibly, maintenance now is simpler, because I can – or I can ask my users to – edit the JSON files and add, update or remove entries as they see fit. No recompiling necessary. No need to touch code, therefore fewer chances of introducing errors. So… we’re better off, right?
I don’t think so, conventional wisdom notwithstanding.
Let’s start with the maintenance bit. Editing JSON vs. editing a map data structure in code is about evenly easy or difficult. If you or your users can edit one, they can edit the other. If your users can’t be trusted with editing a map, then I argue JSON is even more difficult to edit, and now you need to build a user interface for them, which is far more code. The reason the JSON is more difficult to edit is because you do not get instant feedback on your JSON edits if you use a typical editor. If you edit code in an IDE, any syntax errors are caught instantaneously. It’s the ultimate “upstreaming” or “shift left” idea – your errors are caught not at run-time, not during testing, not during compiling, but during editing!
Sure, you could use a JSON editor, but the best you will get out of it is fixing the silliest syntax errors, like the presence/absence of colons, commas, quotes, braces and brackets. Contrast that with a halfway decent IDE, which can not only catch silly syntax errors, but also type and structure errors. Your IDE can tell you about spurious and missing elements in your data structure even if your syntax is correct. If you’re willing to put in more effort into your data structures, you could change the map to be from an enumeration (for country code) to another enumeration (for currency code), and instantly start getting a modicum of semantic checking under the guise of type checking. In other words, with that extra effort, you can ensure that your map always has a legitimate country mapping to a legitimate currency, and that a country is not repeated, all at edit time. Heck, if you forget to add a country to the map, your IDE can probably also warn you about it.
What about the recompiling bit? I will postulate that the project that contains this currency retrieval code is either worked upon often, or it is not. If it is not, the entire project is like a library that can be compiled infrequently and linked into the rest of your codebase. Yes, this library will change when you add, update, remove countries, but we’re in the not-worked-upon-often clause of my argument. Contrariwise, if this project is worked upon often, other unrelated changes will initiate recompilations anyway. The few extra milliseconds to compile this code doesn’t move the needle either way.
Despite these arguments, if we generously concede that the code approach is about the same effort as the config file approach, let’s take a second look at that config file approach, and consider the hidden complexity.
If we decouple the data from the code in this example, we now have to carry around two files – one for the code, one for the config – and make sure both are “linked” together meaningfully. “Linked” how? For starters, both have to be present in your deployment for this tiny application to work. Sure, you can tighten your deployments, think of config as code, and then deploy. Well, now, every time your config changes, you have to deploy again, exactly as would have been the case had you embedded your data in code. If you deploy your config separately from your code, you have to worry about previously-unforeseen error cases, like what happens if the file is absent? what happens if the file doesn’t have the right permissions? what happens if the file gets corrupted? what happens if there are race conditions between deployments of new versions of the code and the file?
It gets worse. If you decouple the people – human beings – who edit the code and who edit the config file (a favourite trope of software developers), how do you ensure that these two sets of people have the same mental model of what goes in the config file? How do these people agree about what to do about missing countries? Or about duplicate entries? Our countries accidentally mapped to more than one currency? Or with entries having extra JSON snippets? Ah, documentation to the rescue. So now, we create a third file that explains – in human-readable language – what mental models we need to adopt to understand how the code and data interplay. And… who will keep that documentation up-to-date? Who volunteers to read it every single time either code or data changes?
What if we wanted to make legitimate changes to this application? Say, we wanted to add the year the country adopted this currency. I won’t belabour the benefits of type-checking the year here – make it an integer and reap the benefits in the code-based approach. I will say though, that a code-based approach makes it trivial to write a unit test to perform a bounds check on that year – is it a legitimate year that has occurred in human history? A config-based approach postpones such a check to run-time, and even that check will be sacrificed at the altar of “fast startup time”.
How will the changes be rolled out? In a code-based approach, change the data and code in one shot, submit one pull request, and you’re done. But in a config-based approach, either you synchronise the code and config changes in one deployment (in which case, you didn’t gain anything by separating the two), or you tolerate asynchronous code and config deployments. That means, you now have to deal with backward compatibility. Your config file cannot worry about compatibility because, by definition, it is data alone and cannot worry about anything. But your code has to worry about parsing the new format and the old format of the config file. That leads to code bloat, and the inevitable hand-wringing about technical debt. Erasing that technical debt means deprecating the old config format so we can trim the code to deal with just the new format. Deprecate too soon, and you now have a problem if your config, for whatever reason, rolls back. Now, your code expects the new config format, but you have an old config, and… boom. Treat the new elements as optional? Sure… back to code bloat, because now you need test cases for when the year is present and when it’s not present, you need sensible defaults, and your entire codebase has to understand about optionality.
So, should we do away with config files altogether? Embrace the atavism of data embedded in code? I don’t think so. There are plenty of cases where you need config files. For example, if you have different service configurations for “production” and for “testing”, it makes sense to group those in different config files. My cautionary note is about cases where the config files are capturing some truths about our world, truths that change relatively infrequently, and truths that are, um, true, regardless of your service configurations. Truths about countries, currencies, continents, months and days of weeks, browsers, operating systems, automobile models, colours, genres, and many, many other things are truths like that – they are about our world, they change slowly, and they are true no matter in what mode your application is running. We should seriously consider returning these configurations back into code.