[Tutorial] How to compare files in Python

Note: this is not the best or only way to do this task, it’s merely how I managed to do it.

Assume you have several repositories (a central location in which data is stored and managed) and each repository should contain three files that could be identical. The task is to see if they are identical, or how close are they, when compared to a “control” repository that’s chosen by the user.

Now one of files is located within one folder in the repositories, let’s call it .ab. The other two files are located within a second folder in the repositories, let’s call it .bc.

How can we compare every repository’s files to the “control”?

==========

LIBRARIES

==========

One thing we’ll need to know, is what libraries to import:

# Used to get location of files and test if files exist
import os

# Used to compare the two files
import difflib

The os library gives us all kinds of great abilities, like getting filepaths and checking for files existence (which is super important). 

The difflib gives us the ability to compare lists and return differences, the most important part.

==========

VARIABLES

==========

We’ll need some variables:

control_repo = "apps-authentication"
folder_list = [".ab", '.bc', '.bc']
text_files = ["config.yml", "01_packages.config", '02_python.config']
output = open("output.txt","w")
list_of_repos = ["apps-authentication", "apps-credentials", "apps-workflow", "apps-testFolder"]

I ended up making control_repo be the name of the control repo which contained our “baseline” files.

Then I made a list of the folders I would be checking. You’ll note that I’ve included ‘.bc’ twice, that’s because I’m checking it twice because two files are located in that folder.

text_files is the variable containing the files within the folders I want to check.

output is a text file I opened to output the data to since it seemed like the best way to display the output data. The “w” is for “write” and the “output.txt” created a text file by that name on my desktop.

list_of_repos is a list of names of repos (as strings). “apps-testFolder” is an empty folder I intentionally created for testing purposes.

==========

MORE ON LIBRARIES

==========

There was some other variables and things I needed that were incredibly useful:

my_dir = os.path.abspath(os.path.dirname(__file__))

This is two os functions nested in each other:

  1. os.path.dirname(__file__) returns what folder your .py folder is in. I made sure to put that folder in the same folder as the repos (repositories).
  2. os.path.abspath() returns the filepath of what you put in it.

What is a filepath? It’s like something’s address:

c://users/Desktop/Projects/

It tells you where it’s located on the computer. So my_dir is basically my saved address to where I’m working which is useful because everything I want to check is here.

This is just handy because we’ll need to use our address a lot.

Then there’s:

os.path.join(my_dir, list_of_repos[x], folder_list[y], text_files[y])

This returns a path by adding things together. So this will take my_dir and then add on (joining them) to form a new extended address. So for our example it will look something like:

c://users/Desktop/Projects/list_of_repos[x]/folder_list[y]/text_files[y]

Only those lists will be replaced with the strings that are those specified locations (of indexes x and y).

This allows us to construct the exact address of where we want to be.

Then there’s:

os.path.isfile(os.path.join(my_dir, list_of_repos[x], folder_list[y], text_files[y])) == False

Which is super useful because it checks if this address actually leads to something that exists. In this situation I really wanted to make sure that a file not existing was a handled situation.

Those are most of the os tools, but then there’s the difflib which we only used one function from:

diff = difflib.unified_diff(control_file.readlines(), test_file.readlines())

This compares the control_file and the test_file and returns the differences.

Which reminds me that I need to explain control_file and test_file. They’re files, and they have a function called .readlines() which puts the content into a list for reading.

So I used a function (.readlines()) to turn all the lines in the files into a list and then compare the two lists and get the differences.

==========

CODE

==========

Now this is where I use nested for-loops.

The first loop goes from 0 to the length of folder_list( len(folder_list) ). So at y = 0 this says the folder located at folder_list[0], the text file located at text_file[0].

Then it enters the second for-loop. It goes from 0 to the length of list_of_repos ( len(list_of_repos) ). Two loops were required because the length of folder_list and list_of_repos are not necessarily the same length. If they were I could probably just use a single loop.

# Two four loops, one to go through all of the folders, the other to go
# through all the repos

# y goes through the folders & files being compared
for y in range (0, len(folder_list)):
    for x in range (0, len(list_of_repos)):

Now begins my if-statements. I want to catch if a file isn’t there before I go forward with any of my necessary checks.

        # Test to see if the file exists, if it does not, return error message
        if os.path.isfile(os.path.join(my_dir, list_of_repos[x], folder_list[y], text_files[y])) == False:
            print >> output, "\n=============="
            print >> output, "%s/%s/%s not found." % (list_of_repos[x], folder_list[y], text_files[y])
            print >> output, "=============="

So first if the file doesn’t exist (os.path.isfile() checks if this address is legitimate) then just say, “Hey, this can’t be found.”

But then I was like, “Wait, what if the control repo doesn’t exist?” So I check for that too. It’s basically the same as above only it only checks the control repo:

        # Test to see if the control_repo exists, if not: return error message
        elif os.path.isfile(os.path.join(my_dir, control_repo, folder_list[y], text_files[y])) == False:
            print >> output, "\n=============="
            print >> output, "control_repo:\n %s/%s/%s not found." % (control_repo, folder_list[y], text_files[y])
            print >> output, "=============="

elif (or “else if”) allows me to make the program check it out.

Now, what if the control repo is checking itself? I tried to address this here:

        # Test to see if the selected repo IS the control_repo, if so: just return message
        elif (list_of_repos[x] == control_repo):
            print >> output, "\n=============="
            print >> output, "%s/%s/%s is the same" % (control_repo, folder_list[y], text_files[y])
            print >> output, "=============="

It’s easier than the previous two checks, it just looks if the string in list_of_repos matches the string in control_repo. Easy.

OK, we made it this far, that means the files we are checking exist, so we get to the “else” statement (sort of the catchall):

        # Open control_file and test_file then compare them and print off differences (use difflib)
        else:
            control_file = open(os.path.join(my_dir, control_repo, folder_list[y], text_files[y]))
            test_file = open(os.path.join(my_dir, list_of_repos[x], folder_list[y], text_files[y]))

            print >> output, "\n=============="
            print >> output, "%s\n%s vs %s" % (text_files[y], list_of_repos[x], control_repo)
            print >> output, "=============="

            # Actual comparison, I'm not 100% sure how this works
            diff = difflib.unified_diff(control_file.readlines(), test_file.readlines())

            # Print off differences between the two documents
            for row in diff:
                print >> output, row

            # Close the files opened above
            test_file.close()
            control_file.close()

# Close text file. I'm not sure it's necessary but seems like a good idea.
output.close()

I’ve been using the os.path.join() all this time to give me the address to direct the program too. Here I use it with the open() function to open the specified file.

Now I have two files (test_file, control_file), which I turn into lists with .readlines() which I then feed into difflib.unified_diff().

Now this code prints it to the text file “output.txt”:

print >> output,

It does that with whatever I put after the comma. So the variable diff is basically a list of differences, so now we print off each “row” and print them to the text file.

After that it, we use the .close() function to close our files. I’m not sure it’s necessary, but it seems like a good idea.

After running this you should be able to open “output.txt” and see your results. It might be hard to parse at first (since it’s not immediately clear). Here’s the code:

Code Meaning
'- ' line unique to sequence 1
'+ ' line unique to sequence 2
'  ' line common to both sequences
'? ' line not present in either input sequence

That should be it. I will attempt this again only next time I’ll use dictionaries which out to be really fun. If you got this far, I hope it was helpful.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s