Skip to content

It’s time to kill the file

The file is a core concept of any operating system, and it makes perfect sense for the operating system and for developers to interact with files. However, the users of operating systems are forced into using files to interact with their content, and that’s just broken. It’s time to kill the file.

Search for data with commercial value is a solved problem

Here is a quote from Steve Jobs at the 2005 AllThingsD conference about the file system:

In every user interface study we’ve ever done […], [we found] it’s pretty easy to learn how to use these things ’til you hit the file system and then the learning curve goes vertical. So you ask yourself, why is the file system the face of the OS? Wouldn’t it be better if there was a better way to find stuff?
— Steve Jobs

The concept of file is powerful, because it is simple: a file represents a specific piece of data, and you can give it a name. And from there, people have gotten used to organize their content in trees of sub-folders because this is the only solution operating systems give them. However, file system hierarchies only make sense for data that can be organized along a single dimension, and the problem is that most content has multiple dimensions. The search feature of today’s operating systems regularly fails to find the files you are looking for, so you end up trying to remember where you put your files. This means that you have to use some of your brain space to store what is basically your file system index.

Take music for example. Back in the mid 2000’s when MP3 was king, the most organized users would store their MP3 files in a series of sub-folders sorted by genre then by artists. Any other metadata you wanted to search on had to be included either in the folder names or in the file names, which meant possibly hours of work to rename and move files around. For movies, the problem is even more complex: it would be impractical to have all of the metadata related to a movie, like the names of all the actors and producers, as part of the file names.

Luckily for us, Spotify has solved the problem for music, and Netflix/PopcornTime have solved it for movies. And not only have they solved the search problem, they also removed the concept of file from the equation entirely. This makes perfect sense: all I care about is that I am able to search and play a song just by typing the title or the artist’s name. The computer represents that song maybe as a file, or maybe as something else, I really don’t care, I just want to listen to it. Applications like Spotify are doing a great job because:

  1. They deal with data that has a commercial value: they can afford to spend money tagging and classifying that data, because they can profit from it.
  2. They care only about one type of specialized data: there is a finite number of known dimensions they need to handle, ex: title, artist, year, producer, etc.
  3. The work is cost effective: they only need to classify the data once, and they can offer it to an infinite number of customers.

Search for personal data is an unsolved problem

Phil Libin, the CEO of Evernote, explained in an interview his vision for the future of personal data search:

The idea of files and the idea of documents is receding, it is becoming less important, and to be replaced with […] insights and knowledge. There is already too many files to deal with, too many documents, too many folders. […] helping you find files isn’t particularly useful. What you want is to know things, you want to be smarter. […] But you shouldn’t have to think [of files] as discrete entities, you should think about what you are trying to find out, and then you should know it.
— Phil Libin

The problem of search for personal data is hard and still unsolved. Let me explain. I have hundreds of files on my computer, that were generated by various specialized applications: graphic design, accounting, notes, recordings, etc., all in different file formats. How do I make sense of all that? Sure I could create a hierarchy of sub-folders, but first, it would take me hours to classify everything, and second, that hierarchy would represent only one dimension of the data, and thus there will always be some metadata I won’t be able to use when searching. Recent versions of Mac OS X have a tagging feature named Tags, which allows to create tags and give them to files. Those tags can later be used when searching, but even then, I would have to do the work manually, so it’s not really solving the problem. Brett Geoghegan gathered his thoughts on the matter in a blog article, which made for a good discussion on Hacker News.

In addition to that, I have a bunch of contracts, bills, receipts, bank statements, and other various documents in PDF format. All of the knowledge is in it: company names, dates, customer numbers, etc. A computer should be able to just scan over those files, identify exactly what they are about, and index them appropriately for me to search inside them later, along any dimensions needed. There should be no need for me to do anything manually in that process. And again, I do not care that this data is represented by the operating system as a collection of files: if I search for “water bills 2014”, I just want to see the list of my water bills for 2014, and be given the option to click on them to display their content.

Towards a unified search

At home, I have folders filled with hard paper documents, that are again contracts, bills, receipts, bank statements, etc. How do I search through them? Ideally, there will be an application that would allow me to just scan them in batch and that would do OCR to extract the text, and then would index their content for me.

You can see even further. What about emails, Facebook, and other means of communication? Let’s take a hypothetical example, let’s imagine that in 2014 I had issues with my water bills, and so I emailed the water company and my account was regularized. Later that week, the water company did a bank transfer to send me the money they owed me. Thus If I search for “water bills 2014”, I want to see not only the water bills for 2014, but also all the email communication that happened, and the bank transfers from the water company extracted from my bank statements. What I really need is a data provider that can take all my data and documents, build insights and meaning between them, and unify them under a single search box. This could then be coupled with a digital personal assistant like Siri, to whom you could just ask deep-knowledge questions about your data and get direct answers.

Once we assume that we have such a data provider, we can start talking about APIs and connectivity. For example, I recently moved houses, so I had to let my local city hall know, which is fine with me. But I also had to contact around 10 utility and service companies that sometimes send me paper mail. How did we come to the point that we accept to deal with so much bureaucracy and paperwork all the time? I shouldn’t have to think about contacting those 10 companies, all I should be required to do is to contact the city hall, which will then send an API request to my data provider to log the event, and my data provider will in turn broadcast the change to all the utility and service companies who would want to contact me, with no additional action needed from me. Of course, there are privacy and security issues with all that, but these should be manageable.

The next big paradigm shift

I believe that there is a great opportunity in solving the personal file management and personal data search problem, and I’m not alone. At the time I am writing this article, there are 154 companies listed as Personal Data Startups on AngelList, but none of them is widely known and/or successful at a large scale. The formerly famous Greplin/Cue has shut down, and so have the physical mail digitizing companies Outbox and Zumbox. Evernote has more than 100M users now, but in my opinion, they are on the wrong path: instead of breaking the problems with the concept of file and the giant sub-folder trees from the start, they have adapted to it, to the point that they are now crippled by feature creep.

Getting rid of the file and making people deal only with knowledge is going to be difficult, because it is a steep shift in the way they are used to interact with their data, and also probably how they want to own their data. But when it happens, and it will happen, it is going to be one of the most significant advancements in personal computing.

Join my email list

Published inBusiness and Start-ups

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *