Archive

Archive for the ‘python’ Category

You can compare two dictionaries

November 8, 2013 Leave a comment

In Python you can compare two dictionaries. Proof:

>>> a
{'a': 1, 'c': 3}
>>> b
{'a': 1, 'c': 3}
>>> a == b
True
>>> b['c'] = 4
>>> a
{'a': 1, 'c': 3}
>>> b
{'a': 1, 'c': 4}
>>> a == b
False

(Note that comparison works between two strings and between two lists too.)

Thanks to Eszter S. for the tip.

Extracting relevant images from XXX galleries using text clustering

November 8, 2013 1 comment

Warning! This post includes some links to NSFW (not suitable for work) galleries. You had better study this post at home :)


Problem
On the web you can find lots of free XXX galleries. There are also sites that collect these galleries and update their list at a daily frequence. When you visit such a gallery, you get either (1) images, or (2) links to images through thumbnails. But! Beside these relevant images, there is always some noise: banners, other thumbnails, links to other galleries, etc.

How to write a universal scraper that gets the URL of a gallery and it extracts just the relevant images without any noise? How to separate real content from noise?

Example
Let’s see a soft gallery: http://biertijd.xxx/index.php?itemid=44329 (NSFW!). Extracting all the images we get the following list:

  "urls": [
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_logo.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titlestart.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_top1.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titleend.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGleft.png", 
    "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/../banners/1.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/../banners/2.jpg", 
    "http://biertijd.com/nucleus/plugins/rating/4.gif", 
    "http://biertijd.com/action.php?action=plugin&name=Captcha&type=captcha&key=0da28fe49a3d6d2fa7e17d15b9a05d28", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGright.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomleft.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomBG.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomright.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomfill.png", 
    "http://s4.histats.com/stats/0.gif?37757&1"
  ]

As you can see, the relevant images conform to this pattern: “http://media01.biertijd.com/galleries/metart/131107_night/{01..20}.jpg“. Altogether we have 35 images of which only 20 are relevant. How to find these 20 only?

Solution
The good news is that the relevant images usually follow a pattern and thus they don’t differ much. As seen above, in this example just the numbering of the images were different.

Relevant images can be separated from the others using text clustering. I found a great solution here by Rajesh M. Rajesh uses this method for clustering article titles. We will use it to cluster URLs, which are also just strings.

I put my solution in a class. Here it is:

#!/usr/bin/env python

# based on:
# http://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/

from helper import lev_dist as distance
from pprint import pprint

DISTANCE = 10

class Cluster(object):
    """
    Clustering a list of (sorted!) strings.

    I use it for clustering URLs. After extracting all the links (or images)
    from a web page, I use this class to group together similar URLs. It also
    identifies the largest cluster.
    """
    def __init__(self):
        self.clusters = {'clusters': {}}

    def clustering(self, elems):
        """
        Clusterize the input elements.

        Input: list of words (e.g. list of URLs). It MUST be sorted!

        Process: build a dictionary where keys are cluster IDs (int) and
                 values are lists (elements in the given cluster)
        """
        clusters = {}
        cid = 0

        for i, line in enumerate(elems):
            if i == 0:
                clusters[cid] = []
                clusters[cid].append(line)
            else:
                last = clusters[cid][-1]
                if distance(last, line)  maxi_v:
                    maxi_v = len(v)
                    maxi_k = k
        #
        return clusters[maxi_k]

    def show(self):
        pprint(self.clusters)

def get_clusters(elems):
    elems = sorted(elems)
    cl = Cluster()
    cl.clustering(elems)
    return cl.clusters['clusters']

#############################################################################

if __name__ == "__main__":
    import sys
    template = "https://jabbalaci.herokuapp.com/all_images?url={url}?&clusters=1"
    if len(sys.argv) == 1:
        print "Usage: {0} URL".format(sys.argv[0])
        sys.exit(1)
    # else
    url = template.format(url=sys.argv[1])
    import requests
    r = requests.get(url)
    li = sorted(r.json()['urls'])

    cl = Cluster()
    cl.clustering(li)
    cl.show()

The extracted URLs are sorted first. Then, they are put in clusters. The idea is simple. Put the first element in the current cluster, which is the first cluster. If the next element is similar, put it into the first cluster again. If it’s different, create a new cluster (it will be the current cluster) and add to it. And so on.

To tell how similar two strings are, we use the Levenshtein distance. You can find an implementation here.

Demo
This method is implemented as a web service. It has two versions: you can cluster links, or you can cluster images. Which one to use? It depends on the gallery. If it includes the relevant images, then extract the images. If it contains thumbnails that point to images, then extract links.

Don’t forget to switch on the “text clustering” option. In the output you will get the clusters and to facilitate your life, the largest cluster is also indicated. In most of the cases, this is the cluster that contains the relevant images!

Sample output:

...
"clusters": {
    "0": [
      "http://biertijd.com/action.php?action=plugin&name=Captcha&type=captcha&key=d079746fd366f6f3509532688d595fcb"
    ], 
    "1": [
      "http://biertijd.com/nucleus/plugins/rating/4.gif"
    ], 
    "2": [
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGleft.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGright.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomBG.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomfill.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomleft.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomright.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_logo.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titleend.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titlestart.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_top1.png"
    ], 
    "3": [
      "http://media01.biertijd.com/galleries/metart/131107_night/../banners/1.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/../banners/2.jpg"
    ], 
    "4": [
      "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg"
    ], 
    "5": [
      "http://s4.histats.com/stats/0.gif?37757&1"
    ], 
    "largest": [
      "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg"
    ], 
    "number_of_clusters": 6
  }, 
...

Demo for the lazy pigs
I made a page that extracts relevant links/images from a gallery and presents them in a cleaned gallery. It’s available here: https://jabbalaci.herokuapp.com/gallery .

Usage: insert the gallery’s URL then click on the first button. If you click on an image and it’s just a thumbnail, then click on the second button.

It extracts the largest cluster and it gives good results in most cases.

Feedbacks are welcome.

Links

funny Python snippet

November 5, 2013 Leave a comment

Found here.

>>> {}['no lock']
Categories: fun, python

Heroku: strange client IP addresses

November 3, 2013 Leave a comment

Problem
In Flask, you can ask the client’s IP address with request.remote_addr . If you try to print this value on Heroku, you will get strange IP addresses that have nothing to do with the client’s IP.

Why?
It’s because your app. at Heroku is behind proxies and your app. will see the proxies’ IP, not the real client’s IP.

Fortunately there is a fix for this problem here: http://flask.pocoo.org/docs/deploying/others/#proxy-setups. You just need to insert these two lines in the production code:

from werkzeug.contrib.fixers import ProxyFix
app.wsgi_app = ProxyFix(app.wsgi_app)
Categories: python Tags: , , ,

Determine the dimensions of an image on the web without downloading it entirely

November 3, 2013 Leave a comment

Problem
You have a list of image URLs and you want to do something with them. You need their dimensions (width, height) BUT you don’t want to download them completely.

Solution
I found a nice working solution here (see the bottom of the linked page).

I copy the code here for future references:

#!/usr/bin/env python

import urllib
import ImageFile

def getsizes(uri):
    # get file size *and* image size (None if not known)
    file = urllib.urlopen(uri)
    size = file.headers.get("content-length")
    if size: 
        size = int(size)
    p = ImageFile.Parser()
    while True:
        data = file.read(1024)
        if not data:
            break
        p.feed(data)
        if p.image:
            return size, p.image.size
            break
    file.close()
    return size, None

##########

if __name__ == "__main__":
    url = "https://upload.wikimedia.org/wikipedia/commons/1/12/Baobob_tree.jpg"
    print getsizes(url)

Sample output:

(1866490, (1164, 1738))

Where the first value is the size of the file in bytes, and the second is a tuple with width and height of the image in pixels.

Update (20140406)
I had to figure out the dimensions of some image files on my local filesystem. Here is the slightly modified version of the code above:

import ImageFile

def getsizes(fname):
    # get file size *and* image size (None if not known)
    file = open(fname)
    size = os.path.getsize(fname)
    p = ImageFile.Parser()
    while True:
        data = file.read(1024)
        if not data:
            break
        p.feed(data)
        if p.image:
            return size, p.image.size
            break
    file.close()
    return size, None

Usage:

size = getsizes(fname)[1]
if size:
    # process it

Heroku: development and production settings

November 2, 2013 Leave a comment

Problem
You have a project that you develop on your local machine and you deploy it on Heroku for instance. The two environments require different settings. For example, you test your app. with SQLite but in production you use PostgreSQL. How can the application configure itself to its environment?

Solution
I show you how to do it with Flask.

In your project folder:

$ heroku config:set HEROKU=1

It will create an environment variable at Heroku. These environment variables are persistent – they will remain in place across deploys and app restarts – so unless you need to change values, you only need to set them once.

Then create a config.py file in your project folder:

import os

class Config(object):
    DEBUG = False
    TESTING = False
    DATABASE_URI = 'sqlite://:memory:'

class ProductionConfig(Config):
    """
    Heroku
    """
    REDIS_URI = os.environ.get('REDISTOGO_URL')

class DevelopmentConfig(Config):
    """
    localhost
    """
    DEBUG = True
    REDIS_URI = 'redis://localhost:6379'

class TestingConfig(Config):
    TESTING = True

Of course, you will have to customize it with your own settings.

Then, in your main file:

...
app = Flask(__name__)

if 'HEROKU' in os.environ:
    # production on Heroku
    app.config.from_object('config.ProductionConfig')
else:
    # development on localhost
    app.config.from_object('config.DevelopmentConfig')
...

Now, if you want to access the configuration from different files of the project, use this:

from flask import current_app as app
...
app.config['MY_SETTINGS']

Redis
Let’s see how to use Redis for instance. Apply the same idea with other databases too. Opening and closing can go in the before_request and teardown_request functions:

from flask import g
import redis

@app.before_request
def before_request():
    g.redis = redis.from_url(app.config['REDIS_URI'])

@app.teardown_request
def teardown_request(exception):
    pass    # g.redis doesn't need to be closed explicitly

If you need to access redis from other files, just import g and use g.redis .

Links

Remove duplicates from a list AND keep the original order of the elements

October 31, 2013 2 comments

Problem
You have a list of elements. You want to remove the duplicates but you want to keep the original order of the elements too. Example:

input: apple, fruit, dog, fruit, cat, apple, dog, cat

output: apple, fruit, dog, cat

Solution

def remove_duplicates(li):
    my_set = set()
    res = []
    for e in li:
        if e not in my_set:
            res.append(e)
            my_set.add(e)
    #
    return res

The trick list(set(li)) is not acceptable in this case because elements are unordered in a set.

Generate a 192-bit random number

October 4, 2013 2 comments
import os
os.urandom(24)    # length: 24 bytes, i.e. 24*8=192 bits

See the doc. here.

Formatting its output:

>>> os.urandom(24)
'\x17\x96e\x94]\xa0\xb8\x1e\x8b\xee\xdd\xe9\x91^\x9c\xda\x94\t\xe8S\xa1Oe_'
>>> os.urandom(24).encode('hex')
'cd48e1c22de0961d5d1bfb14f8a66e006cfb1cfbf3f0c0f3'
>>> int(os.urandom(24).encode('hex'), 16)
625318378251135334886162535673249000280269152689162062986L
>>> bin(int(os.urandom(24).encode('hex'), 16))
'0b10100010101001110001101011101111010000111101010010110011111101111101111010100111100000001010100100001000100101010011100001001100011000011000000101101111100001011111011101001110011010001000010'

Update (20170523)
The code above is for Python 2. Here is the Python 3 version:

>>> import os
>>> os.urandom(24)
b':\xea\x8b\xb8\xf4\x04q\xc9$\xd9B\xdf\xaf\xcer\xa0t`Q:\xab{&\xfc'
>>> os.urandom(24).hex()
'11fe838db0c5f661b09f2f7a8de5ac44395e2fdc8128d211'
>>> int(os.urandom(24).hex(), 16)
1970467794856825403422320664826041246218521986958597162907
>>> bin(int(os.urandom(24).hex(), 16))
'0b111101001000010010001110011110010001101100000010011000000001111001101111011111010111110111011110110110110001111010100100000110110111101011000100011010001111001110111001110001010101011001001001'
Categories: python Tags: ,

Python testing frameworks

October 2, 2013 1 comment

The site http://pythontesting.net/start-here/ covers the following Python testing frameworks:

  • doctest
  • unittest
  • nose
  • pytest
Categories: python Tags: , , ,

using virtualenv (Part 2)

September 21, 2013 Leave a comment

In Part 1 we saw how to use virtualenv.

Now let’s see how to colorize the bash prompt and how to activate a virtual environment easily.

Colorize the prompt
When a virtual env. is activated, the prompt changes. However, this change is not very visible because it’s not colorized. The name of the virtual env. should be printed with a different color, thus it would be visible immediately if an env. is activated or not.

Fortunately someone else also had this problem :) Here I found an excellent solution. I only changed the PS1 line the following way:

  PS1="${PYTHON_VIRTUALENV}${GREEN}\u@\h ${YELLOW}\w${COLOR_NONE} ${BRANCH}${PROMPT_SYMBOL} "

This way the prompt and the cursor are in one line. I made a fork, my slightly modified version is available here.

Update (20131001): I updated the script above to support light background too. Instructions are in the header in a comment.

Usage

Save the file above as ~/.bash_prompt and add the following line to the end of your ~/.bashrc:

source ~/.bash_prompt

The resulting prompt is way cooler than the default bash prompt, thus you can use it even if you don’t work with virtual environments!

See the end of the post for a screenshot.

Activating a virtual environment easily
The standard way for activating a virtual env. is to source the script “activate“:

jabba@jabba-uplink ~/python/mystuff $ ls -al
total 24
drwxrwxr-x   3 jabba jabba  4096 Sep 21 16:16 .
drwxrwxr-x 234 jabba jabba 12288 Sep 21 15:58 ..
-rwxrw-r--   1 jabba jabba    53 Sep 21 16:16 hello.py
drwxrwxr-x   6 jabba jabba  4096 Sep 21 16:13 venv
jabba@jabba-uplink ~/python/mystuff $ . venv/bin/activate
[venv] jabba@jabba-uplink ~/python/mystuff $

However, using the command “workon venv” would be much easier. I wanted to do it with an alias, but bash aliases do not accept arguments. There is a workaround: use functions instead.

Add the following lines to your ~/.bashrc:

func_workon()
{
if [[ -z "$1" ]]
then
    echo "Usage: workon <venv>"
else
    . $1/bin/activate
fi
}
alias workon=func_workon

alias workoff='deactivate'

Now you can activate a virtual env. much easier:

jabba@jabba-uplink ~/python/mystuff $ ls -al
total 24
drwxrwxr-x   3 jabba jabba  4096 Sep 21 16:16 .
drwxrwxr-x 234 jabba jabba 12288 Sep 21 15:58 ..
-rwxrw-r--   1 jabba jabba    53 Sep 21 16:16 hello.py
drwxrwxr-x   6 jabba jabba  4096 Sep 21 16:13 venv
jabba@jabba-uplink ~/python/mystuff $ workon venv
[venv] jabba@jabba-uplink ~/python/mystuff $

For deactivation use the command “deactivate“, or my alias “workoff“.

Screenshot
workon

Next step
There is virtualenvwrapper, which is “a set of extensions to Ian Bicking’s virtualenv tool. The extensions include wrappers for creating and deleting virtual environments and otherwise managing your development workflow, making it easier to work on more than one project at a time without introducing conflicts in their dependencies.

I haven’t used it yet.

Design a site like this with WordPress.com
Get started