Thursday, March 14, 2013

Monitoring a web page for changes using bash

There's this conference that I'd like to attend and I've heard that it's a hard-to-get-into type conference.  When I go to their site it doesn't have any new info.

Rather than checking the site every day, I'd like to have it monitored and be alerted when something new DOES appear on it.

Now I know there are services like ChangeDetection.com that can monitor it for me, but I was wanting to cobble something together with the tools I already have.  I'd also like to have the ability to customize what it consider "a change" at my disposal when/if I need it.

To that end, I threw together the following bash script.  It monitors a URL and if it detects a change, it sends an email to my gmail account letting me know.

Hope you find it useful.  BTW, I'm using a program called sendEmail to send the email notification.  It's in apt if you're using a debian/ubuntu-like distribution.

#!/bin/bash

# monitor.sh - Monitors a web page for changes
# sends an email notification if the file change

USERNAME="me@gmail.com"
PASSWORD="itzasecret"
URL="http://thepage.com/that/I/want/to/monitor"

for (( ; ; )); do
    mv new.html old.html 2> /dev/null
    curl $URL -L --compressed -s > new.html
    DIFF_OUTPUT="$(diff new.html old.html)"
    if [ "0" != "${#DIFF_OUTPUT}" ]; then
        sendEmail -f $USERNAME -s smtp.gmail.com:587 \
            -xu $USERNAME -xp $PASSWORD -t $USERNAME \
            -o tls=yes -u "Web page changed" \
            -m "Visit it at $URL"
        sleep 10
    fi
done

Then from a bash prompt I run it with the following command:

nohup ./monitor.sh &

Using nohup and throwing it in the background allows me to log out and have the script continue to run.

Monday, February 4, 2013

Fun with Python's multiprocessing module

At my current client, they asked me to write a Python script to aggregate data from a RESTful web service.  Essentially, they have a web service end-point that takes a single customer ID as an argument and it returns some customer profile records as JSON.

They wanted my script to call the web service for each and every customer listed in a text file.  The script will be run from a scheduled job (cron, windows scheduled task, etc.).  Once productionalized, the text file will contain a large number of customer ID's (like a million or so).  All the results need to be stored in a single text file for the run.

My first thought was to use Python's threading features, but I came with 2 problems with this:

  • Python's threading isn't "real" threading.  A single Python process will run all threads on a single physical core regardless of how many cores the server has.  The other cores will not be utilized.
  • Python's urllib and urllib2 packages aren't even thread safe.
Now I know there's a third-party thread-safe urllib3 package, but that doesn't solve the multi-core issue and I wasn't looking to install anything beyond Python 2.7's standard library on the server.  After doing a little research, I came up with the idea of using Python's multiprocessing module.

The rest of this blog posting is a walkthrough of what I ended up building.

First things first - configuration

Since this is a script that will be run often and certain configuration settings will need to be updated from time to time by support staff, I decided to extract the configuration out to a separate configuration file.  Initially, I had the settings stored in a .ini file, but I later thought why not use Python for settings.  To that end, I have a settings.py file that holds a class that contains the settings:

#!/usr/bin/env python
"""
Settings file.  See job.py for more info.
"""

# -----------------------------------------------------------
# Edit the values in this Settings class to control settings.
# -----------------------------------------------------------

class Settings:
    """Settings as a class.  Why?  Because I'm lazy."""

    source_file = 'TestData.csv'

    profile_data_file = 'results.txt'

    log_file = 'log.txt'

    failed_customer_file = 'failed.txt'

    process_count = 10

    url = 'http://division-1.internal-server.com/rest-service/endpoint/{0}'

There is one input file (source_file).  This is the file that contains the list of customer ID's.  Then there are 3 output files:  profile_data_file, log_file and failed_customer_file.  The profile_data_file file is where the script will put the customer detail records that it receives from the web service.  The log_file is where the script will write logging messages for debugging problems after the fact.  Finally, failed_customer_file is where the script will write customer ID's of customers that the script fails to retrieve from the web services.

process_count is a setting that will allow you to specify how many parallel processes the script should spawn.

url is the url of the web service end-point.

The main job script is called job.py.  It imports the settings.py module and then creates a new class that inherits from the Settings class.

class Config(settings.Settings):
    """Configuration class"""

    _runtime = datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
   
    @staticmethod
    def timestamp(filename):
        """Stamps a filename with the timestamp for this run"""
        parts = os.path.splitext(filename)
        return "%s-%s%s" % (parts[0], Config._runtime, parts[1])

This class has a static method called timestamp.  In the main job, I will call the timestamp method to add the date/time of the run to the output filenames.  So you might see something like:

outfile = open(Config.timestamp(Config.log_file), 'w')

This will open a file named 'log-20120204-112530.txt' for writing.

The Job class

The Job class is the main class that executes the job and spawns the worker processes.  It has the following methods:
  • __init__ - constructor, set's up a multiprocessing Pool and 3 Queue's
  • run - Kicks off the job
  • get_customers - Reads the customer file
  • get_requests - Transforms the customers list into a list of processes requests
  • process_customer_queue - Deals with responses from the web service
  • process_log_queue - Deals with log messages from the worker processes
  • process_exceptions_queue - Deals with exceptions that occur in the worker processes
  • log - Writes log messages to the log file
Here's a copy of the Job class:

class Job:
    """Job that performs the run."""

    def __init__(self):
        self._manager = multiprocessing.Manager()
        self._pool = multiprocessing.Pool(Config.process_count)
        self._customer_queue = self._manager.Queue()
        self._log_queue = self._manager.Queue()
        self._exception_queue = self._manager.Queue()

    def run(self):
        """Do it."""
        start_time = datetime.datetime.now()
        self.log('[Run started]')
        customers = self.get_customers()
        requests = self.get_requests(customers)
        result = self._pool.map_async(get_customer, requests)
        while not result.ready():
            self.process_customer_queue()
            self.process_log_queue()
            self.process_exceptions_queue()
        self._pool.close()
        self._pool.join()
        self.process_customer_queue()
        self.process_log_queue()
        self.process_exceptions_queue()
        self.log('[Run finished]')
        self.log('[Total runtime: %s]' % (datetime.datetime.now() - start_time))

    def get_customers(self):
        """Read the source file."""
        buf = [line.strip() for line in open(Config.source_file).readlines()[1:]]
        return buf

    def get_requests(self, customers):
        """Generate requests from customers."""
        requests = [{ \
            'customer': customer, \
            'data': self._customer_queue, \
            'log': self._log_queue, \
            'exceptions': self._exception_queue, \
            'url': Config.url, \
        } for customer in customers]
        return requests

    def process_customer_queue(self):
        """Pull messages off the data queue."""
        try:
            message = self._customer_queue.get_nowait()
        except Queue.Empty:
            return
        customer = message['customer']
        details = message['server_response']['details']
        outfile = open(Config.timestamp(Config.profile_data_file), 'a')
        for record in details:
            buf = "%s, %s, %s, %s\n" % (customer, record['id'], \
                record['relevanceScore'], record['relevanceRank'])
            outfile.write(buf)
        outfile.close()
            
    def process_log_queue(self):
        """Pull messages off the log queue."""
        try:
            message = self._log_queue.get_nowait()
        except Queue.Empty:
            return
        self.log(message)

    def process_exceptions_queue(self):
        """Pull messages off the exceptions queue."""
        try:
            message = self._exception_queue.get_nowait()
        except Queue.Empty:
            return
        customer = message['customer']
        exception = message['exception']
        self.log("EXCEPTION GETTING %s! - %s" % \
            (customer, str(exception)))
        failed_file = open(Config.timestamp(Config.failed_customer_file), 'a')
        buf = "%s\n" % customer
        failed_file.write(buf)
        failed_file.close()

    def log(self, message):
        """Write message to the log file."""
        logfile = open(Config.timestamp(Config.log_file), 'a')
        timestamp = datetime.datetime.now().strftime('%Y/%m/%d %H:%M:%S')
        logfile.write("%s - %s\n" % (timestamp, message))
        logfile.close()

Some important things to highlight.  The constructor creates three queues:  customer_queue, log_queue, and exceptions_queue.  These queues are used for interprocess communication.  The constructor creates a Pool of worker processes.  Each process will communicate back to the parent process via these queues.  The customer_queue is responsible for holding the results of the web services calls.  The log_queue is responsible for holding log messages.  The exceptions_queue is responsible for holding exceptions that occur when worker processes have problems talking to the web service.

In the run method, the Job class will call the Pool class' map_async method.  The map_async method splits the requests list into multiple lists and passes each list to a separate worker process for processing. Then the run method goes into a loop, waiting for the worker processes to all complete.  While it's waiting, it will continually check the queues to see if any messages have been received.

The stand-alone function

The map_async method's first argument is the name of the function that each worker process should run the request list items with.  In this case, the name of that function is get_customer.  get_customer looks like this:

def get_customer(request):
    """Retrieve the customer details from the given request."""
    customer = request['customer']
    url = request['url']
    data = request['data']
    log = request['log']
    exceptions = request['exceptions']
    try:
        log.put("Requesting details for customer %s" % customer)
        req_url = url.replace('{0}', customer)
        request = urllib2.Request(req_url)
        response = urllib2.urlopen(request)
        buf = response.read()
        data_records = json.loads(buf)
        coupons.put({'customer: customer, 'server_response': data_records})
        log.put("Successfully retrieved %d details(s) for %s" % \
            (len(data_records), customer))
    except Exception, exc:
        exceptions.put({'customer': customer, 'exception': exc})

This function is called for each request in the worker processes.  It extracts queues from the request.  It uses the put method on each queue to pass information back to the parent process.

Pulling it all together

The last thing in my script file is this:

if __name__ == '__main__':
    # Let's do this!
    Job().run()

It's important to embed the Job().run() call inside the 'if __main__' branching.  This is because of how the multiprocessing internals work.  When the Pool.map_async method is executed, Python will create a number of sub processes.  When each process starts up, it's actually spawning a copy of the Python executable.  When that process comes up, it imports the job module and then starts calling the get_customer function over and over with the requests.  If the Job().run() were not embedded in a 'if __main__' branch, every worker process would spawn more Job classes.  Nothing good would come of that!

Hope you find this helpful.

Tuesday, January 29, 2013

Create and track svn branches using git-svn

My most recent client is utilizing svn for the source control needs.  While I used to be a big svn guy, I've more recently been using git and have found the transition back to svn a little awkward.

Enter: git-svn.  I love git-svn because I can exist in a svn environment without giving up all the things I love about git.  One thing that continually escapes my memory is how to create a svn branch from the git CLI as well as how to associate a git branch with a svn branch.

This blog entry is my own cheat-sheet.

Say, I want to create a svn branch off of the trunk.  Here's what I would do:

$ git checkout master
$ git svn rebase
$ git svn branch -m "Creating a feature branch" feature_1001
$ git svn dcommit
$ git checkout -b my_feature_1001 feature_1001

Another thing that isn't so intuitive is merging your changes back to trunk when you're done.  Here's what I would do:


$ git checkout feature_1001
$ git rebase master
$ git checkout master
$ git merge feature_1001
$ git svn dcommit
$ git branch -D feature_1001

Then I have to use a svn client to actually delete the svn branch.

Also, if a svn branch is removed, you need to remove it by hand from git with

$ git branch -rd feature_1001



Friday, October 19, 2012

Reformatting for 80 characters in vim

After reading The Pragmatic Programmer and Clean Code, I've pretty much settled on using text files (specifically in markdown format) for all my personal notes and documentation and editing them with vim.  That being said, I like to keep my files limited to 80 columns wide.  That way I can look at and review multiple documents simultaneously in splits and vertical split easily.

In my .vimrc I have the following setting to get the auto wrapping working for me:

set textwidth=80

That way as I type along, vim is keeping track of the file's column width and when I get to the 80th column it automatically moves me to the next line.  Pretty handy.

Here's my problem though, if I change some text that I've already entered and reword it, my paragraph loses it's nicely formatted set column width.  Here's an example but with the textwidth setting set at 40:

Before my edit change:

Lorem ipsum dolor sit amet, consectetur
adipiscing elit.  Proin neque sapien,
facilisis eget tincidunt ut, porttitor
laoreet lacus. Class aptent taciti
sociosqu ad litora torquent per conubia
nostra, per inceptos himenaeos.  Etiam
semper elementum congue.

After my edit change:

Lorem ipsum dolor sit amet, consectetur
adipiscing elit.  Proin neque sapien,
facilisis eget tincidunt ut, porttitor
laoreet lacus. Class aptent taciti
sociosqu ad litora torquent per conubia
nostra, per inceptos himenaeos.  I
forgot some text. Etiam
semper elementum congue.

Notice the problem:  After editing the paragraph,  the second to last line doesn't go to the 40th column before wrapping.  This is a little annoying to me because I feel the need to reformat the paragraph to get it back to looking like the "before".  It's not too big of a deal when I only have to reformat a line or two but if I've change something high up in a fairly long paragraph, reformatting dozens of lines can get tedious.

Here's a little tip that I stumbled across that can make that reformatting go a bit quicker:

1.  Select the whole paragraph with V. and then h,j,k and/or l.
2.  Then strip the new-line characters with J.
3.  Finally reformat it with gq.

Hope you find this helpful!

Friday, July 6, 2012

CGI Script to display clone urls


After some discussion, I was able to convince my client to convert from gitorious to gitolite.  One nice feature that people like about gitorious is that the web interface provides an easy way to look up urls for cloning repositories.

In my mind, that's a pretty legitimate need.  To that end, I threw together this bash script, which acts as a cgi and dropped it in the cgi-bin directory of the server that's running gitolite.  Hope you find this helpful.

#!/bin/bash

REPOSITORY_DIR="/home/git/repositories/"
URL_PREFIX="git clone git@internal-build-server:"

cat <<DONE
Content-type: text/html

<html>
<head>
<title>Repository URLs</title>
<link rel="stylesheet" type="text/css" href="/index.css" />
</head>
<body>
<div id="page_container">
<h1>
Repository URLs
</h1>
<p>
Repository URLs on this server follow a specific pattern.  The pattern is
as follows:
</p>
<center>
git@internal-build-server:<i><font color="darkblue">{category}</font></i>/<i><font color="darkblue">{project}</font></i>
</center>
<p>
These URLs are both pull and push URLs.  You do not need separate URLs
for pulling and pushing.  Access control will be handled by a
server git update hook that is provided by gitolite.
</p>
<p>
In an effort to make life a little easier in locating your URLs, this script
enumerates URLs for the repositories located on this machine below.
</p>
DONE

CATEGORIES=$(find $REPOSITORY_DIR -type d -maxdepth 1 -mindepth 1 -not \
-iname '*.git' | sed -e "s|$REPOSITORY_DIR||g")

for CATEGORY in $CATEGORIES; do
echo "<h2>Category: $CATEGORY</h2>"
CAT_REPOSITORIES=$(find $REPOSITORY_DIR$CATEGORY -type d -iname \
'*.git' | sed -e "s|$REPOSITORY_DIR||g" -e 's/.git$//g')
for REPOSITORY in $CAT_REPOSITORIES; do
echo "$URL_PREFIX$REPOSITORY<br />"
done
done

ROOT_REPOSITORIES=$(find $REPOSITORY_DIR -type d -maxdepth 1 -mindepth 1 \
-iname '*.git' | sed -e "s|$REPOSITORY_DIR||g" -e 's/.git$//g')
echo "<h2>Uncategorized Repositories</h2>"
for REPOSITORY in $ROOT_REPOSITORIES; do
echo "$URL_PREFIX$REPOSITORY<br />"
done

cat <<DONE
<br />
</div>
</body>
</html>
DONE

Monday, June 25, 2012

Bash script for pulling/fetching multiple git clones


In my current assignment, I'm acting as the main build guy for a number of projects that use git for source control.  As such, I find it very useful to keep all my git clones up to date whether I'm actively developing in them or not.  Additionally, I need to review the changes other developers are committing, so I'd like to get a summary of recent git activities.

Over time, I've put this little bash script together to help me with that.  I've included the script in this posting so I can remember later what/why I did this.  Disclaimer, I wrote this script and run this script in bash on Linux (not via git-bash in Windows).  Also, I'm using Zenity for some nice UI look/feel.

  1 #!/bin/bash
  2
  3 pushd ~/dev/repos > /dev/null
  4
  5 # The log file
  6 PULL_LOG="$(mktemp)"
  7
  8 # Get a list of all the clones in this directory.
  9 CLONES=$(find -maxdepth 2 -mindepth 2 -type d -name ".git" | sed -e 's|\./||' -e 's|/\.git||')
 10
 11 # Get a list of all the branches in clone/branch format
 12 ALL_BRANCHES=$(for clone in $CLONES; do cd $clone; for branch in $(git branch -l | sed 's/\s\|\*//g'); do echo $clone/$branch; done; cd ..; done)
 13
 14 # Count the branches
 15 BRANCH_COUNT=$(echo $ALL_BRANCHES | sed 's/ /\n/g' | wc -l)
 16
 17 # Start the log file
 18 echo "Pull log for $(date)" >> $PULL_LOG
 19 echo "--------------------------------------------------------------------------------" >> $PULL_LOG
 20
 21 # Function for pipping output to zenity progress dialog
 22 function pull_clones() {
 23     clone_counter=0
 24     for clone in $CLONES; do
 25         echo "Pulling branches for clone $clone" >> $PULL_LOG
 26         echo "--------------------------------------------------------------------------------" >> $PULL_LOG
 27         cd $clone
 28         echo "# Fetching changes for clone $clone"
 29         git fetch origin 2>> $PULL_LOG
 30         for branch in $(git branch -l | sed 's/\s\|\*//g'); do
 31             echo "# Merging branch $clone/$branch"
 32             echo "Merging branch $branch" >> $PULL_LOG
 33             git checkout $branch 2> /dev/null
 34             git merge origin/$branch >> $PULL_LOG
 35             echo | awk '{print count / total * 100}' count=$clone_counter total=$BRANCH_COUNT
 36             let clone_counter=clone_counter+1
 37         done
 38         cd ..
 39         echo >> $PULL_LOG
 40     done
 41 }
 42
 43 # Do it
 44 pull_clones | zenity --progress --title='Pulling development clones' --width=512
 45 zenity --text-info --filename=$PULL_LOG --title="Pull log" --width=500 --height=450
 46
 47 #Clean up
 48 rm $PULL_LOG
 49
 50 popd > /dev/null



Wednesday, June 6, 2012

Patching tip using mocks in python unit tests

I use the mock library by Michael Foord in my python unit tests and one problem always plagued me.  Here's the problem and the solution.

Sometimes when I import a package/module in my code I use this pattern (let's call it pattern A):

"""file_module_pattern_a.py"""
import os

def get_files(path):
    """Return list of files"""
    return os.listdir(path)


Other times, I use this pattern (let's call it pattern B):

"""file_module_pattern_b.py"""
from os import listdir

def get_files(path):
    """Return list of files"""
    return listdir(path_variable)

Note the differente.  In pattern A, I import the whole os package, while in pattern B, I only import the listdir function.  Now in my unit tests, here's what I use for pattern A:

"""Unit tests for module file_module_pattern_a"""

from file_module_pattern_a import get_files
from unittest import TestCase
from mock import patch, sentinel

class StandloneTests(TestCase):
    """Test the standalone functions"""
    
    @patch('os.listdir')
    def test_get_files(self, mock_listdir):
        """Test the get_files function"""
        test_result = get_files(sentinel.PATH)
        mock_listdir.assert_called_once_with(sentinel.PATH)
        self.assertEqual(test_result, mock_listdir.return_value)

This works great.  The only problem is... if I use pattern B with this unit test, the mock_listdir never gets called.  The unit test tries to use the REAL os.listdir function.

Here's the issue at hand.  When I use pattern B, I'm actually adding the function to my module, not the global scope.  As a result, the patch directive needs to reference my module, not os.  Here's the correct unit test patch syntax:

"""Unit tests for module file_module_pattern_b"""

from file_module_pattern_b import get_files
from unittest import TestCase
from mock import patch, sentinel

class StandloneTests(TestCase):
    """Test the standalone functions"""
    
    @patch('file_module_pattern_b.listdir')
    def test_get_files(self, mock_listdir):
        """Test the get_files function"""
        test_result = get_files(sentinel.PATH)
        mock_listdir.assert_called_once_with(sentinel.PATH)
        self.assertEqual(test_result, mock_listdir.return_value)