Andrew Channels Dexter Pinion

Wherein I write some stuff that you may like to read. Or not, its up to you really.

September 05, 2005

Checking my file copies

I spent a fair part of the weekend rearranging the digital photo files on my hard drive. I moved from a single folder to a nice nested structure which organises the pictures by date. Before I delete the files from the original folder I wanted to ensure that I'd copied every one. So I wrote some code. Which I now present with a small request, how could I have done this better?

By which I mean, more elegantly, efficiently or more readably, with the only constraint being that I'm not going to change the language it's written in. Answers in the comments please.

def check(directoryA, directoryB):
     "Check that every file in directoryA is present somewhere under directoryB"
     filelist = Set()
     for dirpath, dirname, filenames in os.walk(directoryB):
         for file in filenames:
            filelist.add(file)
    missing = []
    for file in os.listdir(directoryA):
        # find file under directoryB
         if not file.startswith('.') and file not in filelist:
             # print "%s is missing" % file
             missing.append(file)
    return missing

Posted by Andy Todd at September 05, 2005 07:17 PM

Comments

Sort of more elegant and sort of not: this is a bit untested but it should work.

def check(dirA,dirB):
  nested_files = [files for dp,dn,files in os.walk(dirB)
  files_in_B = [x for subseq in nested_files for x in subseq] # flatten list
  missing = [file for file in os.listdir(dirA) if file not in files_in_B]
  return missing

Posted by: Stuart Langridge on September 5, 2005 08:40 PM

(ah, oops, put the ] on the end of the "nested_files =" line)

Posted by: Stuart Langridge on September 5, 2005 08:42 PM

Might it be worth checking that the files have the correct content? It'll take a *lot* longer, naturally, but might it be worth comparing MD5 hashes?

Posted by: Simon Brunning on September 5, 2005 09:09 PM

OK Stuart, can you explain what on earth

[x for subseq in nested_files for x in subseq]

does? Because it makes me screw up my eyes and squint at it.

Posted by: Andy Todd on September 5, 2005 11:00 PM

Oh hang on, reading the manual [1] helps.

It's the equivalent of;

for subseq in nested_files:
    for x in subseq:
         .. do stuff ..

[1] http://www.python.org/peps/pep-0202.html

Posted by: Andy Todd on September 5, 2005 11:06 PM

Andy: yeah, it's a nested list comprehension. I only found out about them about three days ago when I googled for "python flatten list" :)

Posted by: Stuart Langridge on September 6, 2005 05:48 PM

Haven't played with it but this could take the strain: http://docs.python.org/lib/module-filecmp.html

Posted by: AndyB on September 7, 2005 03:05 AM

Why not really use sets:

# for Python 2.4
# from sets import Set as set # uncomment for 2.3

def check(directoryA, directoryB, verbose=True):
    "Check that every file in directoryA is present somewhere under directoryB"
    originals = set(os.listdir(directoryA))
    copies = set()
    for dirpath, dirname, filenames in os.walk(directoryB):
        copies |= set(filenames)
    missing = originals - copies
    if verbose:
        for name in missing:
            if not name.startswith('.'):
                print "%s is missing" % name
    return missing

Posted by: Scott David Daniels on September 10, 2005 07:57 AM