Posted: November 30th, 2011 | Author: Giv | Filed under: Django, MongoDB, Python, Tutorials | No Comments »
Traditional relational databases (mySQL PostgreSQL etc) and noSQL systems are not mutually exclusive. I have several Django applications that are happily using mySQL. If your site is not scaling due to your database, you are doing it wrong! noSQL will not help you until you start caching some of those expensive queries using something like Memcached.
I use MongoDB alongside mySQL for all the dirty work like storing stats for later processing. There’s no point in polluting mySQL with this sort of data, especially when you’re dealing with millions of entries.
This post is intended for absolute beginners who use Django tranditionally and are curious about how they can integrate a secondary storage service into their apps. I’m assuming you have already installed MongoDB on your dev environment. You will also need to install the MongoEngine library for Python.
Let’s start.
You already know how to create data models in Django, but let’s say we want to store an activity feed for your users everytime they do something on your site. We begin by creating a data model similar to Django’s ORM using MongoEngine but the difference here is that you don’t need to run “syncdb” to create your tables. Mongo’s collections (similar to SQL tables) are schemaless so these models can be manipulated and you won’t need to worry about running migration scripts.
Let’s create a simple collection for storing user activities. Create a file where you normally keep your Django models and call it mongomodel.py
1
2
3
4
5
6
7
8
9
10
11
12
| from mongoengine import *
# connect to a db (no need to create this - it will be created automagically)
connect('useractivity')
class Author(Document):
pk = IntField()
name = StringField(max_length=200, required=True)
class Activity(Document):
message = StringField(max_length=200)
author = ReferenceField(Author, reverse_delete_rule=CASCADE) |
“What’s this??!! Django already has a User model, why do I need another in Mongo?” Well, you don’t, but say you want your activity to say something like: “Joe uploaded a photo” and you want Joe’s name to be linked to his profile page. We keep a reference to his mySQL id in case we need to look up other info or construct a URL.
You’ll also notice in the Activity model we are referencing the Author model. This is like a foreign key that will allow us to create relationships, similar to SQL. The CASCADE option will make sure if the user is deleted, all activities are also cleared out.
Ok, let’s start using this puppy! Using the example above we want to create an activity for Joe next time he uploads a photo. First, import mongomodel.py whenever you’re planning to interact with Mongo. In my photo upload view function I will create an activity like so:
1
2
3
4
5
6
7
8
9
10
11
| # After photo upload is complete
from main.mongomodel import *
# first create a user object - you can grab data from request object
the_author = Author(pk=request.user.id, name=request.user.first_name)
the_author.save()
# now create the activity
activity = Activity(message='uploaded a new photo', author=the_author)
activity.save() |
That’s it. If you decide later you also want to add the name of the file uploaded you can simply add a new field to your Activity model and it will just work, plus it will be backwards compatible, i.e. older records without this field will not complain. Lovely.
Displaying the activity is just as simple. In your view function pull out the record and push down to your template:
1
2
3
4
5
6
7
| from main.mongomodel import *
# get all activities
activities = Activity.objects
# push down to template
return render_to_response('activities.html', {'activities':activites}) |
Now in your template loop and output like any other model:
1
2
3
4
5
| <ul>
{% for a in activites %}
<li><a href="{% url main.views.profile a.author.pk %}">{{ a.author.name }}</a> {{ a.message }}</li>
{% endfor %}
</ul> |
I’ve used the user’s mySQL primary key to construct his profile URL.
This is a very basic example but hopefully you can see the advantage of offloading some of the data storage to Mongo. You may ask “but what if the user changes his name? won’t the data in the activity remain out of sync?”. Yes, it will, but you can very easily add a simple method in your Django user model to update Mongo records whenever the user’s details are updated.
Good luck.
No Comments »
Posted: August 1st, 2011 | Author: Giv | Filed under: Python | No Comments »
This is a short post. I spent too long working this out so hopefully this post will help a future Google search.
If you’re using the Boto python wrapper for the Amazon S3 service, you can quickly generate temporary URLs for your private files.
1
2
3
4
| from boto.s3.connection import S3Connection
s3 = S3Connection('YOUR_KEY', 'YOUR_SECRET', is_secure=False)
url = s3.generate_url(60, 'GET', bucket='YOUR_BUCKET', key='YOUR_FILE_KEY', force_http=True) |
This will give you a URL to your private file on S3 that will only work for 60 seconds. It will look something like this:
http://mycoolbucket.s3.amazonaws.com/myfile.jpg?
Signature=ABC123DEF456&
Expires=1312216031&
AWSAccessKeyId=ABCDEFGHIJKLMNOP
No Comments »
Posted: June 13th, 2011 | Author: Giv | Filed under: Python | 3 Comments »
The project I’m currently working on requires cropping of hundreds of portraits from the First World War archives at the Imperial War Museum.
Running a batch script on a directory of images is straight forward except my script is pretty dumb and tries to do a centre crop to create a square image. Unfortunately some of these images are not suitable for centre cropping:

Some of these portraits are quite long in height so a centre crop often results in the decapitation of the subject!
The logical thing to do here is to have your script first detect where the face is and then make a more intelligent crop to ensure the face remains in the new image. But surely face recognition requires super computers and several PhDs? Yes, it does. But we don’t really care who the subject is, we just need to know where the face is (or at least something that looks like a face). What we need is face detection, not recognition.
I was surprised to come across this little beauty: OpenCV, an open library for vision processing and luckily there’s a nice Python binding for it.
I tried out a sample from Robert Martin McGuire’s blog and was amazed at how simple and effective it was.
Robert’s script spits out two coordinates from the image that places a rectangle of where the face is. If your image has more than one person in it (or things that look like faces – more on that later) it will return two sets for each face.
Here’s the same image after running it through our face detection script:

Perfect! now we can adjust our cropping script to ensure that the face is within the bounds.
I tried this using really high resolution images and the script detected several faces in the image where there was only one. The problem is that if you have a lot of detail in your image like background artifacts and smudges there is likely to be some pattern that matches those of a face. For best results you may want to work with smaller images.
You can get this script from Robert’s site but here it is for all you lazy people. Make sure you’ve installed all necessary libraries. On Debian/Ubuntu you should be able to use this:
$ sudo apt-get install python-opencv libcv-dev python-imaging
Test out the script like this:
$python thescript.py original.jpg output.jpg
If you get errors chances are it’s not finding the XML files. I had to copy these manually to get it to work. Note: this script doesn’t do any cropping, it just shows you where the face is and you will need to do the cropping yourself with some trial and error.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| import os
import sys
from opencv.cv import *
from opencv.highgui import *
import Image, ImageDraw
def print_rectangle(x1,y1,x2,y2): #function to modify the img
im = Image.open(sys.argv[1])
draw = ImageDraw.Draw(im)
draw.rectangle([x1,y1,x2,y2])
im.save(sys.argv[2])
def detectObjects(image):
"""Converts an image to grayscale and prints the locations of any
faces found"""
grayscale = cvCreateImage(cvSize(image.width, image.height), 8, 1)
cvCvtColor(image, grayscale, CV_BGR2GRAY)
storage = cvCreateMemStorage(0)
cvClearMemStorage(storage)
cvEqualizeHist(grayscale, grayscale)
cascade = cvLoadHaarClassifierCascade(
'/usr/share/opencv/haarcascade/haarcascade_frontalface_alt.xml',
cvSize(1, 1))
faces = cvHaarDetectObjects(grayscale, cascade, storage, 1.2, 2,
CV_HAAR_DO_CANNY_PRUNING, cvSize(50, 50))
if faces.total > 0:
for f in faces:
x1,y1,x2,y2=f.x,f.y,f.x+f.width,f.y+f.height
print("[(%d,%d) -> (%d,%d)]" % (f.x, f.y, f.x + f.width, f.y + f.height))
print_rectangle(x1,y1,x2,y2) #call to a python pil
def main():
image = cvLoadImage(sys.argv[1]);
detectObjects(image)
if __name__ == "__main__":
main() |
3 Comments »
Posted: July 22nd, 2010 | Author: Giv | Filed under: Python, Tutorials | 2 Comments »
There seem to be a lot of developers who like the idea of Test-Driven Development (TDD) and can clearly see the benefit of having tests written for their code but can’t seem to get their head around the process. How do you start writing unit tests before writing the actual code?
Let’s start with an example. You want to write a method that takes a URL as an argument and have it tell you through a boolean return if it’s the correct domain or not. It seems simple enough. Just write your method, pass the URL through some regular expression and you’re done.
But you yourself already know which domains are allowed and which are not so before writing the actual code you can run some tests in your head. E.g. I only want www.bbc.co.uk and its sub-domains on http and https. Nothing else should be allowed. So https://beta.bbc.co.uk/iplayer should return TRUE and http://www.bbcbb.com should return FALSE etc.
The process behind TDD is that you first write a failing test. Then you write the actual code and adjust until the test passes.
So let’s write some tests for our domain checker. I’m using Python and Unittest here:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| import unittest
# this is what we're going to be testing
class Utils():
def is_bbc(self, val):
pass #placeholder
# this is the actual test
class TestUtils(unittest.TestCase):
def setUp(self):
self.u = Utils()
def test_is_bbc(self):
self.assertTrue(self.u.is_bbc('http://www.bbc.co.uk/iplayer'))
self.assertTrue(self.u.is_bbc('http://www.bbc.co.uk/food'))
self.assertTrue(self.u.is_bbc('http://www.bbc.co.uk'))
self.assertTrue(self.u.is_bbc('https://www.bbc.co.uk'))
self.assertTrue(self.u.is_bbc('http://beta.bbc.co.uk'))
self.assertFalse(self.u.is_bbc('http://www.bbc.com'))
self.assertFalse(self.u.is_bbc('http://www.bbbc.co.uk'))
self.assertFalse(self.u.is_bbc('http://.bbc.co.uk'))
suite = unittest.TestLoader().loadTestsFromTestCase(TestUtils)
unittest.TextTestRunner(verbosity=2).run(suite) |
I’ve created a an empty method where our domain checker is going to live but as you can see it doesn’t do anything. The tests should immediately make sense. We pass a bunch of domain variations and we know which ones should pass or fail. Naturally, running the test right now will fail:
$ python sample.py
test_is_bbc (__main__.TestUtils) ... FAIL
======================================================================
FAIL: test_is_bbc (__main__.TestUtils)
----------------------------------------------------------------------
Traceback (most recent call last):
File "sample.py", line 14, in test_is_bbc
self.assertTrue(self.u.is_bbc('http://www.bbc.co.uk/iplayer'))
AssertionError
----------------------------------------------------------------------
Ran 1 test in 0.000s
FAILED (failures=1)
Now you can start writing the actual code and keep running the same tests until it passes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| import unittest
import re
# this is what we're going to be testing
class Utils():
def is_bbc(self, val):
return re.match('^https?://([^/]+)?\.bbc\.co\.uk', val)
# this is the actual test
class TestUtils(unittest.TestCase):
def setUp(self):
self.u = Utils()
def test_is_bbc(self):
self.assertTrue(self.u.is_bbc('http://www.bbc.co.uk/iplayer'))
self.assertTrue(self.u.is_bbc('http://www.bbc.co.uk/food'))
self.assertTrue(self.u.is_bbc('http://www.bbc.co.uk'))
self.assertTrue(self.u.is_bbc('https://www.bbc.co.uk'))
self.assertTrue(self.u.is_bbc('http://beta.bbc.co.uk'))
self.assertFalse(self.u.is_bbc('http://www.bbc.com'))
self.assertFalse(self.u.is_bbc('http://www.bbbc.co.uk'))
self.assertFalse(self.u.is_bbc('http://.bbc.co.uk')) #this should fail
suite = unittest.TestLoader().loadTestsFromTestCase(TestUtils)
unittest.TextTestRunner(verbosity=2).run(suite) |
The regex may look like it’s correct but running the test will fail again:
$ python sample.py
test_is_bbc (__main__.TestUtils) ... FAIL
======================================================================
FAIL: test_is_bbc (__main__.TestUtils)
----------------------------------------------------------------------
Traceback (most recent call last):
File "sample.py", line 22, in test_is_bbc
self.assertFalse(self.u.is_bbc('http://.bbc.co.uk'))
AssertionError
----------------------------------------------------------------------
Ran 1 test in 0.001s
FAILED (failures=1)
It has failed on the final assert because our code will also allow http://.bbc.co.uk and we obviously don’t want that. But as you can see we’ve caught this edge case before deploying our app so we can promptly fix our code.
Hopefully this example demonstrates why it’s a good idea to start with tests. This is obviously a simple example but on bigger projects predicting the outcome of your system can save you a lot of debugging time in the future.
2 Comments »
Posted: February 14th, 2010 | Author: Giv | Filed under: Google App Engine, Objective-c, Python | 7 Comments »
After almost a year of messing around with various iPhone development alternatives such as Phonegap and Titanium, I finally decided to learn Objective-C and do it all properly. I actually think those other frameworks are brilliant as they allow you to use familiar languages like Javascript to quickly create nice apps for both iPhone and Android. But since they rely heavily on the web view element for loading HTML, creating sophisticated apps like Skype would be impossible.
So I set out to create an app for the iPhone with Objective-C. My app is pretty simple. It basically pulls in RSS news, audio podcast and video podcast feeds into a UITableView list, allowing the user to read, listen and watch news stories from the Democracy Now! website.

I managed to put together the app pretty quickly but I ran into a lot of issues when I tried to parse and massage the XML data. For starters, cocoa does not have native support for regular expressions (but there are several external libraries). I wanted to clean up the content I was getting back before displaying it to the user but I soon realised something that would normally take me a few minutes in Python/PHP/Javascript would take a lot longer in Objective-C. Parsing XML using NSXMLParser was an absolute nightmare and extremely slow. I rarely work with XML these days and find JSON a much easier protocol to deal with. I even tested the app with some sample JSON data using the excellent json-framework libarary and it was much easier and faster. Alas, I only had RSS feeds to work with.
The other problem I ran into was slow HTTP requests. It would sometimes take up to 20 seconds just to load the first screen. This was due to a combination of slow connection speeds, long response times from the data provider and a slow XML parser.
The solution I came up with was to do as little as possible in the phone app as far as the data was concerned. I decided to use Google App Engine to fetch the data from the source, parse, rejig, massage and beautify in Python, then serialise and return the results in JSON to the phone app to use.
It may sound like this would increase response times even more since the phone would have to first call GAE, then GAE would need to call the data source and then all the way back to the phone. This is true, however, once the data is with GAE we have the luxury of using memcache and datastore. The RSS and podcast feeds are updated once a day so there’s no reason to request the data from the source every time the user loads the app. Because each time we have to make the HTTP call, parse the data and load it up. This is extremely slow and unnecessary. We can just make one request a day, then parse, cleanup and cache the results for the next user that requests it.
So the app only talks to GAE. GAE first checks memcache to see if we have a cached version. If we don’t, it will make the HTTP call, fetch the data, parse, serialise, cache and return results. If we do have a cached version, there’s nothing else to do but to return the data. A cron job will also run every 24 hours to make sure memcache is up to date.

If you really want a solid and reliable app, you need to think about all the edge cases also. What happens if the cache expires and the data provider’s website is down? At that exact moment a user loads the app only to get an error message saying there’s nothing to show. An unlikely scenario but not impossible. So the way I got around this issue was to store the serialised JSON output in GAE’s datastore as well. We always use the data from memcache but should memcache be empty and the data source down, we can switch over to the datastore and load yesterday’s content instead. Not ideal but better than having a broken app.
This is a bit of an overkill for such a simple app but it’s super fast and efficient and will work well for almost any app that relies on 3rd-party APIs. To be fair, it was my lack of experience with Objective-C that led me to using GAE. I feel much more comfortable in Python than Objective-C and I’m sure an experienced cocoa developer would have no problems parsing and massaging data in the app itself.
Of course there is one other edge case – Google App Engine could go down or worst, the interwebz could break. In which case, a simple error message will suffice.
You can download Democracy Now! app on iTunes
7 Comments »