This was panel discussion by number of Python developers about various ways of doing things in Python. The panel included representatives from Spotify, author of Pandas, a guy from Enthought and few two other people whose names I did not write down.
After short introduction the panelists answered questions from the audience. Three topics on which I took notes were:
- Packaging – for production or releases – Many used Debian based packing. Few mentions of the Jenkins build server. Some people would deploy directly from their version control system (eg. Mercurial or Team City).
- 2. I asked about testing their software- Everyone uses some form of unit tests. Many people recommended “nose” for managing unit test suites (https://github.com/nose-devs/nose). A mention was made of “coverage.py” for developing tests to cover your code.
- A question of interfacing with C/C++ was asked. There are several packages used here mainly: swig with cpython. Everyone really like cython (www.cython.org) for real nice C/C++ integration. Also PyQL quantlibs were mentioned. Finally, ctypes – although ctypes are not so easy to use with C++.
Spotify – Recommendation Engine
Turns out that Spotify is a big Python user and Spotify people were giving bunch of interesting talks. In this talk he speaker described how the Spotify recomendation engine works.
Using machine learning techniques and data extracted from Spotify logs Spotify tries to find a related artist/track to play next. Spotify extract information about what tracks each user plays and then creates a huge (!!!) matrix of all tracks vs. all users with the values of each cell being a count how many times a track was played.
Using this data (along with skip data) Spotify slices and dices this matrix (the speaker did not go into the full mathematicals details of this) and creates data that tells them which tracks are similar. This is he basis for recommendations.
Currently Spotify uses 40 node Hadoop cluster to munge all this data and is able to produce a new recommendation graph once every two weeks or so. As more data is collected, completed recalculation is done. Doing an incremental/online recalculation is very difficult.
The speaker noted that they discovered that it is cheaper to run your own cluster rather than renting machines on Amazon EC2.
To figure our the genre of music Spotify mines the playlist names. Though many people name their playlist “My Playlist”, enough people include the music genre to make this extraction possible. Spotify never has to analyze the actual audio files.
Spotify’s backend services use Python and Tokyo Cabinet for data storage.
Spotify – Scaling Python Services
This talk described more details of the system set up that Spotify uses.
The basic requirement is for Spotify to support about 10 million active users, where a user is considered “active” if he/she played a track in the past 30 days. The expected music latency should be less than 200 ms.
There are two approaches to scaling
- “Vertically” – that is buy bigger and bigger machines. This has obvious limitations.
- “Horizontally” – add a lot of machines machines. Here we need to deal with the “split brain problem” – that is the case when parts of the system are down and maybe all data is accessible, and with hardware failures. More machines you have, more likely it is that there will be a hardware failure.
Current Spotify client uses p2p style protocol for caching of songs.
Since Python’s GIL prevents effective multi-threading – all backend processes run just one thread.
Some Design Principles
- Use the Unix way – small programs that do one thing well. Run many instances of a service, but fewer than the number of cores available on a machine.
- When sharding data among many machines we need to consider Brewer’s CAP Theorem – you can have any two of Consistency, Availability and Partition tolerance.
- In cases when Spotify needs strong consistency Postgress is used. Data is shared among machines via memcached.
- SystemD tool is use to manage running processes.
- Spotify uses SRV DNS records for service discovery and load balancing.
Finally the speaker reminded us all that “/dev/null is web scale!!!“.
Final presentation I attended was a 3 hour class on SQL Alchemy. SQL Alchemy is a very nice Python library for interfacing with relational databases. Since I’m using it in one of my projects I decided to attend to learn more.
There are three levels to SQL Alchemy and you can work at any one of these levels.
- Level 1: Includes the Engine class that uses Python DPAPI to talk to the database
- Level 2: Table meta data – facilities to examine and manipulate the database schema.
- Level 3: SQL expressions.
- Level 4: ORM – object relational mapping.
This tutorial was presented by the author of SQLAlchemy and it was full of tons of useful information. In fact the speaker used up all his time – speaking without a break for 3 hours – so that there was little time for questions.