Python

Cache debugging in Python

Published by Emmanuel Goossaert on May 24, 2012

While debugging code, one of the most important factors is speed of iteration. The faster the program can be run and either succeed or fail, the faster the code can be debugged. In most cases one only waits a few seconds before the program fails and return an error code…

Kitn, lightweight contact manager

Published by Emmanuel Goossaert on September 11, 2011

I am not into social network websites, and when it comes to keeping in touch with people, I like to use good old emails. However, as my contact list and the diversity of these contacts expand, I am having problems following up with everybody. When networking with people met on…

Tomatwo, the mean tweeting machine

Published by Emmanuel Goossaert on February 20, 2011

I have seen many blog posts about Twitter bots, but they all seem pretty useless to me. People have this tendency to overuse things like Markov chains or other conversational agent library, just to make their bot look cool. But what’s the point of all that, and what problem are…

Active Appearance Models in C++ (Paamela)

Published by Emmanuel Goossaert on August 12, 2010

My implementation of the Active Appearance Models (AAMs) in C++ is almost done, it is called Paamela. I am currently fixing a couple design issues and finishing up the documentation. Even though I am still not sure whether or not I will make the code open source, I thought it would be nice to share what I have developed so far, in order to help other developers working on similar problems.

Efficient counting with MapReduce

Published by Emmanuel Goossaert on April 15, 2010

Counting with MapReduce seems straightforward. All what is needed is to map the pairs to the same intermediate key, and leave the reduce take care of counting all the items. But wait, what if we have millions of items? Then one reducer, that is to say one process on one computer, will be forced to handle millions of pairs at once. Nonetheless this is going to be very slow, and all the interest of having a cluster will be missed, but there is something more important: what if the data is too big to fit in memory? Here I am showing how to count elements using MapReduce in a way that really split up the task between multiple workers.

The one-iteration solution

Let us have a look at the solution discussed above. This solution counts items in a data in only one MapReduce iteration. Note that the values are replace with the value 1. Indeed, as counting does not require to keep track of the values, they are all changed to a common simple value to simplify computations. This solution seems pretty sweet, except that as we can see on Figure 1, reducing all the pairs to the same intermediate key gives one reducer, and one reducer only, a huge workload for counting the items. This can be efficient if the dataset is small. But there are cases in which the dataset is so big that it does not even fit into the memory of a single computer, or maybe it is so big that the computation on only one reducer is going to be very slow, and we need to know the count as soon as possible. As we will see in the next section, there is a way to improve workload balance along with computation time, at the cost of an additional iteration.

Figure 1: MapReduce - Counting with one iteration

Simulated annealing applied to the traveling salesman problem

Published by Emmanuel Goossaert on April 6, 2010

Simulated annealing is an optimization technique that finds an approximation of the global minimum of a function. When working on an optimization problem, a model and a cost function are designed specifically for this problem. By applying the simulated annealing technique to this cost function, an optimal solution can be found. In this article, I present the simulated annealing technique, I explain how it applies to the traveling salesman problem, and I perform experiments to understand how the different parameters control the details of the search for an optimal solution. I also provide an implementation in Python, along with graphic visualization of the solutions.

Tag: Python

Cache debugging in Python

Kitn, lightweight contact manager

Tomatwo, the mean tweeting machine

Active Appearance Models in C++ (Paamela)

Efficient counting with MapReduce

The one-iteration solution

Simulated annealing applied to the traveling salesman problem