Pygotham II – Day 1

This year’s Pygotham conference was held in early June. Instead of an office building, the conference took place on two ships docked at Pier 40 in NYC. It was an unusual venue for a tech conference, but with the real nice weather it was great.

I’ve attended bunch of talks, and below are short summaries of each.

Key Note – Eliot Horowitz from 10Gen

The keynote address was a talk on why one would use Python. The main reasons   included: fast prototyping, lots of great libraries and used everywhere. To demonstrate this the speaker quickly build a web application to add and vote on topics that  could be covered later in the keynote. He used MongoDB (of course), Flask web framework and Emacs. He has a basic app running in about 5 minutes. Quite cool.

Another useful Python program that the speaker presented, was a script that goes through his email and deletes stuff that is not relevant. Now this was not a spam filter,  but a tool customized to its user. As the CTO of 10Gen, Eliot gets tons of emails that  do not require any actions – but clog up his mail box. The goal of the script wasthe help to manage his inbox without “getting yelled at”.

Of course, Python is not the best tool for all jobs. When you are writing large programs with a tea m of people, compiled languages with better type checking etc., are better.

In the end of his talk ELiot put in a plug for New York City. In terms of a tech scene “New York is cooler than Palo Alto”. I haven’t been to Palo Alto, but I’m sure he is right. NYC is much cooler.

Pandas – Data Analysis Toolkit

This talk was given by Wes McKinney, the author of Pandas library and also the author of the future O’Reilly book “Python for Data Analysis”.

Pandas is an extensive library build on top of NumPy (Numerical Python Library) and it is meant to fit the same niche as the statistical language R.

To demonstrate the power of Pandas the speaker demonstrated how to analyze some actual data sets that anyone can obtain on the net. The most interesting one was the set of data that provided information about the people who contribute to political campains for various candindates. One thing I found surprizing is that retired people seem to contribute most to politicians.

Less surprizing was that someone listed his occupation as “Zombie Hunter”.

Although not something I can use immediately, the Pandas library looks very
interesting.

Disco – A Map/Reduce Framework

The speaker presented a different open source map/reduce framework that is an alternative to Hadoop. Whereas Hadoop is written in Java and oriented towards that language, Disco uses Python and Erlang. In particular Disco has a native Python interface, as opposed to stream-based interface that Hadoop offers for Python (and other scripting languages).

These sort of frameworks usually consist of two parts – the part that manages the  parallel jobs running on hundreds of machines and the distributed file system that is used to feed data to all these processes. Disco has its own DDFS – Disco Distribute File System, which uses the idea of tags to organize the data. ALthough I don’t fully comprehend how this is actually works (study on my part is required), Disco’s DFS allows  you to reference data without the need for excessive copying.

This last feature was very important to the speaker’s projects – sequencing of DNA. Turns out that today DNA sequencers are relatively cheap, but generate tons of data – 5 to  10 terrabytes per a genome. The way the sequencer works is that is randmonly chops the DNA strands into pieces and the individually sequences each piece.

The task of the software is to put these pieces back together into a coherent sequence. This works by oversampling (i.e. the same DNA is sequenced several times) and then the results are analyzed (see this wiki page for description of some of the algorithms).

Turns out there is a project –1000 Genomes – to completely sequence the  is available publicly.

Since I have been reading about Hadoop, the Disco framework is quite interesting to me, especially since it is a better suited for Python than Hadoop.

Enaml – Python based language for UIs

This was a talk on an extension to Python to create a language called “Enaml” – which can be used to easily construct window/buttons UIs. Take a look  here for more info.

Controlling Spam on he Web

The speaker presented an overview of how we can control spam in comments that are submitted to a web based forums.  According to Rafe’s Law – “An Internet service cannot be considered successful until it has attracted spammers” – so if you are planning to build a successful website, you will need to deal with spam.

The speaker presented a Python library called Hamage Control, that could be used with  together with other web frameworks (eg. Django) to manage spam.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s