Preshing on ProgrammingPreshing on Programming

View Your Filesystem History Using Python

Sometimes, it’s useful to look back on your filesystem history.

For example, after installing some new software, you might want to know which files have changed on your hard drive. Or, if you’re a programmer getting started on a new project, you may need to follow a complex and unfamiliar build process. A list of recently modified files can reveal a lot about how that build process works.

Here’s a short Python script to create such a list. It lists the contents of a folder recursively, sorted by modification time.

As a simple example, I ran it after setting up a fresh copy of my random number sequence project. Here’s the output (with some lines deleted to save space):

2013-01-14 21:44:29       5564 .\build\Testing\Temporary\LastTest.log
2013-01-14 21:44:29         29 .\build\Testing\Temporary\CTestCostData.txt
------------------------------
2013-01-14 21:28:38         91 .\build\Win32\Release\ALL_BUILD\ALL_BUILD.lastbuildstate
2013-01-14 21:28:38       1560 .\build\Win32\Release\ALL_BUILD\custombuild.command.1.tlog
2013-01-14 21:28:38       6386 .\build\Win32\Release\ALL_BUILD\custombuild.read.1.tlog
2013-01-14 21:28:38        674 .\build\Win32\Release\ALL_BUILD\custombuild.write.1.tlog
2013-01-14 21:28:38         51 .\build\CMakeFiles\generate.stamp
2013-01-14 21:28:37         91 .\build\RandomSequence.dir\Release\RandomSequence.lastbuildstate
2013-01-14 21:28:37        678 .\build\RandomSequence.dir\Release\mt.command.1.tlog
2013-01-14 21:28:37        818 .\build\RandomSequence.dir\Release\mt.read.1.tlog
2013-01-14 21:28:37        446 .\build\RandomSequence.dir\Release\mt.write.1.tlog
2013-01-14 21:28:37       7680 .\build\Release\RandomSequence.exe
...
------------------------------
2013-01-14 21:28:21         86 .\build\CMakeFiles\cmake.check_cache
2013-01-14 21:28:21      12856 .\build\CMakeCache.txt
2013-01-14 21:28:21       3712 .\build\RandomSequence.sln
2013-01-14 21:28:21        270 .\build\CMakeFiles\TargetDirectories.txt
2013-01-14 21:28:21        391 .\build\CTestTestfile.cmake
2013-01-14 21:28:21       1586 .\build\cmake_install.cmake
2013-01-14 21:28:21       4204 .\build\CMakeFiles\generate.stamp.depend
2013-01-14 21:28:21      25207 .\build\ZERO_CHECK.vcxproj
2013-01-14 21:28:21        832 .\build\ZERO_CHECK.vcxproj.filters
...
------------------------------
2013-01-14 21:27:40        959 .\randomsequence.h
2013-01-14 21:27:40        416 .\.git\index
2013-01-14 21:27:40       1255 .\main.cpp
2013-01-14 21:27:40        714 .\README.md
2013-01-14 21:27:40        246 .\CMakeLists.txt
2013-01-14 21:27:40         12 .\.gitignore
2013-01-14 21:27:40        336 .\.git\config
2013-01-14 21:27:40        201 .\.git\logs\refs\heads\master
2013-01-14 21:27:40        201 .\.git\logs\HEAD
...

The horizontal dashes separate modifications greater than 10 seconds apart, which helps organize the files visually into groups. In reverse order, you can see the groups of files created by git clone, project files generated by cmake, the build output from cmake --build, and a couple of files written by ctest.

I’ve used this kind of script to help make sense of the filesystem on Ubuntu, and to figure out where files were written on MacOS X using the App Store.

Command-Line Options

Running with no options or with --help displays the following help message:

Usage: list_modifications.py [options] path [path2 ...]

Options:
  -h, --help    show this help message and exit
  -g SECS       set threshold for grouping files
  -f EXC_FILES  exclude files matching a wildcard pattern
  -d EXC_DIRS   exclude directories matching a wildcard pattern

You can filter the output using -f and -d. For example:

list_modifications.py -d obj* -f *.log -f *.bin -g 30 .git build\CMakeFiles

The above command lists the contents of the .git and build\CMakeFiles folders, excluding the objects subfolder and any files ending in .log or .bin. It also groups files modified within 30 seconds of each other, instead of the default 10.

A Quick Look at the Code

This script is a pretty good example of the kind of problem Python can solve quickly using very little code. Here’s a quick run-through.

parser = optparse.OptionParser(usage='Usage: %prog [options] path [path2 ...]')
parser.add_option('-g', action='store', type='long', dest='secs', default=10,
                  help='set threshold for grouping files')
parser.add_option('-f', action='append', type='string', dest='exc_files', default=[],
                  help='exclude files matching a wildcard pattern')
parser.add_option('-d', action='append', type='string', dest='exc_dirs', default=[],
                  help='exclude directories matching a wildcard pattern')
options, roots = parser.parse_args()

This block of code takes care of all command-line option parsing using the built-in optparse module. optparse is deprecated as of Python 2.7, but it’s handy and available since Python 2.5. The --help option is handled automatically.

The -f option uses the 'append' action with a default of [], which means the user can specify -f multiple times, creating a list. In the previous example, we end up with options.exc_files set to ['*.log', '*.bin']. Any leftover positional arguments are assigned to roots as another list; in the previous example, roots becomes ['.git', 'build\\CMakeFiles'].

def iterFiles(options, roots):
    """" A generator to enumerate the contents of directories recursively. """
    for root in roots:
        for dirpath, dirnames, filenames in os.walk(root):
            name = os.path.split(dirpath)[1]
            if any(fnmatch.fnmatch(name, w) for w in options.exc_dirs):
                del dirnames[:]  # Don't recurse here
                continue
            for fn in filenames:
                if any(fnmatch.fnmatch(fn, w) for w in options.exc_files):
                    continue
                path = os.path.join(dirpath, fn)
                stat = os.lstat(path)
                mtime = max(stat.st_mtime, stat.st_ctime)
                yield mtime, stat.st_size, path

iterFiles looks like a function definition, but the presence of the yield statement in the body means it actually defines a generator. As such, calling iterFiles() does not actually execute the function. It returns an iterator, which you can then use in a for loop, as we’ll see later.

iterFiles uses the os.walk generator, which lets us modify the contents of dirnames in-place during iteration. In particular, we clear the contents of the list using del dirnames[:] to avoid descending into certain subdirectories.

In the above code, the expression any(fnmatch.fnmatch(name, w) for w in options.exc_dirs) is known as a generator expression. It’s a lot like a list comprehension, but we’re allowed to omit the square brackets since the list is fed to a single function. In this case, the any function will return True if fnmatch.fnmatch(name, w) returns True for any item in the list.

ptime = 0
for mtime, size, path in sorted(iterFiles(options, roots), reverse=True):
    if ptime - mtime >= options.secs:
        print('-' * 30)
    timeStr = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(mtime))
    print('%s %10d %s' % (timeStr, size, path))
    ptime = mtime

Here, we feed the iterFiles generator to sorted, resulting in a sorted list of 3-tuples. The list is sorted by the first item in the tuple – the modification time – which is exactly what we want. We loop through, writing one line of formatted output for each tuple. Since Python lets us multiply a string by an integer, '-' * 30 is used as a shortcut for drawing horizontal lines.

That’s all there is to it! Hopefully, some readers have managed pick up a few nuggets of Pythonic goodness along the way.