Filenames for publications

There is an unfortunate tendency for people to make papers available on their websites and give them highly descriptive names like types03.pdf or even the gem tr.pdf. I have a directory or two full of files with names like that and despite the promises of Beagle and Spotlight it is very hard to find certain documents.

I know this is something Joshua rightful complains about, and I'm admittedly guilty of this to some degree myself. So I put it on my to-do list a while back and finally decided to take a few minutes today and sort out the filenames I've assigned to my papers. However, when getting started I was faced with the dilemma of just what the naming convention should be. I did some poking around with Google, but couldn't really find any information on choices that other individuals or organizations have made.

Clearly the year and some approximation of the authors last names should be used, but there are a lot of trade-offs.

One system that I do know about is the one used by the Church Project at Boston University. If I remember correctly the filenames and pages are generated from the BibTeX entries. For example, consider the paper »Type inference, principal typings, and let-polymorphism for first-class mixin modules« by Henning Makholm and J. B. Wells. This paper gets a web page named http://types.bu.edu/reports/Mak+Wel:ICFP-2005.html and a PDF file with a very long name. So long that I will instead just describe the format: the last name of the authors separated by the plus symbol, a colon, title of the paper, a colon, conference venue, a dash, and year. While very descriptive this scheme has a few problems. Firstly, colon is a reserved character in at least a few operating systems. Secondly, is that it is just really long name. I can easily imagine one papers with several authors that the name would exceed the common 255 character filename limit that many filesystems have.

So if anyone has suggestions for developing a consistent naming convention, let me know. While more descriptive filenames would be useful, perhaps what is really needed is the metadata equivalent of »ls«. Then we would be in the position to simply complain about the fact that people don't fill in the metadata fields of their documents.

7 Comments »

  1. kitby said,

    April 18, 2007 @ 1:48 pm

    I fail to see how the fact that you have a directory full of files named in a meaningless way (to you) is the fault of the folks who posted those files. Renaming files is easy to do, and the world can’t really anticipate your particular storage scheme (or lack thereof).

    In any event, I archive papers, naming the files based on their full names, and post papers with filenames which correspond to my BibTeX key for them. Works well enough for me, though both naming schemes are, unfortunately, prone to collisions (usually when a tech report and conference version of a paper have the same title, author list, and publication year). In the grand scheme of things, it doesn’t really matter (to me) what I name my files. I’m going to search for papers via my BibTeX file anyways (cf. the lack of a “metadata ls”), so I only need a reliable way to go from BibTeX entry to file on disk.

  2. Chris said,

    April 18, 2007 @ 2:31 pm

    I agree that using the BibTeX key is a good idea, and my favorite convention for BibTeX keys comes from citeseer: it’s just first author’s last name, 2-digit year, and first (or most) significant word from the title. No spaces, colons, dashes, or other funky characters. The numbers of the year are enough to separate the other two words. Examples: league03metaocaml, fisher02links, guillemette07statically, etc.

    I don’t yet keep a nicely organized folder of downloaded, indexed PDFs (if I ever go all-Mac, I’m going to seriously consider DevonThink for such things) but I’ve been fantasizing about it for a while now. I do have a filing cabinet for marked-up dead-tree copies of papers — the index for it is my BibTeX file — but there’s always about an 8-month backlog pile of papers sitting around my office.

  3. Joshua Dunfield said,

    April 18, 2007 @ 8:32 pm

    kitby:
    “I fail to see how the fact that you have a directory full of files named in a meaningless way (to you) is the fault of the folks who posted those files.”

    A file named “icfp04.pdf” is almost completely meaningless to everyone, including (some of) the authors if any of them were involved in more than one submission to ICFP that year. And “icfp04.pdf” isn’t the worst out there; I recently saw a “full.pdf” on a very distinguished PL researcher’s homepage.

    I see it as a matter of courtesy to give files more specific names, but even leaving that aside, self-interest should dictate trying to make it as easy as possible for even the most lazy or frazzled colleague to find your paper after they download it. While CS people have a relatively easy time of downloading the same paper repeatedly since most of us put our papers on our homepages (plus Citeseer, etc.), it’s more effort than just opening a descriptively-named file in a local papers directory. If I can find it quickly I’m more likely to cite it, and that’s what makes our little world go ’round.

    To address Geoff’s post itself, I think we shouldn’t worry too much about the specific details of naming, because stubborn people (like me) will find something to disagree with, and will end up renaming many files anyway. Nonetheless, using reasonably descriptive names, whatever the specifics, is a “first line of defense” and at least eliminates name clashes in my downloads directory.

    (FWIW, I usually do something very similar to what Chris does, e.g. Andreoli92_focusing.pdf. I sometimes use a word or phrase that’s not literally in the title, if it seems more descriptive: Kuncak04_SetsDP.pdf for “On Decision Procedures for Set-Valued Fields”. I also call theses simply “Foster_thesis.pdf”, etc.; it’s not like they’re going to write fifteen more, and for some reason I never have any trouble remembering what those are about.)

  4. kitby said,

    April 19, 2007 @ 12:04 am

    From the post:
    perhaps what is really needed is the metadata equivalent of »ls«.

    Depending on how much metadata we’re talking about here, that may be the only thing that I’ll find workable in the long run.

    These days, I find it easier to use something like BibDesk and run a search over all the available data that BibTeX entries have. Given that I’m fairly inconsistent in how I locate a given paper, I can’t think of any file naming scheme that, when applied consistently to every single file, would make searching by filename work any better.

    Though, maybe this only works because I tend to archive or delete papers fairly quickly.

  5. washburn said,

    April 19, 2007 @ 10:08 pm

    I have to agree with Joshua on well-named files being useful to even the authors: I couldn’t tell you which paper of mine is called MS-CIS-05-07.pdf without looking it up.

  6. lists said,

    May 8, 2007 @ 6:35 pm

    I use a simple c/j_firstAuthorName_lastAuthorName_Conf/JournalName_Topic/Idea.pdf

    the c/j stands for conference or journal. I supposed using dashes instead of underscores would make things easier to type – so I would make that one change.

  7. Rafa said,

    August 16, 2007 @ 6:52 pm

    First: if you have very few papers to read and keep, name them with the full title and first author.

    But if you have a lot, or many people share them in a common repository, think about this simple scheme:

    MainAuthorYEARnumber

    where YEAR is the four digit year and number is an optional self-incrementing variable that is a unique identifier of papers belonging to the same author and in the same year.

    That should be enough, as the best idea to organize your bibliography is to use BibTeX as the SOURCE of cataloging, and a nice interface that suits your needs, such as JabRef or Aigaion to search and manage this bibliography. The point is that bibliographies should work in the same way we use a multimedia player to manage a collection of multimedia files and use the metadata associated with them to find, sort, copy, etc., instead of coping directly with filenames, formats, etc.
    Put the whole bunch of pdfs in one directory and let a nicer program cope with linking BibTeX entries to files within a filesystem. Train the computer, then let it do the job for you!

RSS feed for comments on this post · TrackBack URI

Leave a Comment