Python in Big Data

big datapythonweb-applications

Can python be efficiently implemented in big data field? To be precise I am building an web app that analyses really big data in medical health care field consisting of medical history and huge personal information. I need some advice on how to handle very big data in python efficiently and with high performance. Also are their some open source packages in python available which have high performance and efficiency in big data handling?

About users and data:
Each user has about 3gb of data. Users are grouped based on their family and friend circles and the data is then analysed to predict important information and co-relations out of that. Currently I am talking about 10,000 users and will be rapidly increasing number of users.

Best Answer

That is a very vague question, there is no canon definition for what constitutes big data. From a development point of view the only thing that truly changes how you need to handle data is if you have so much that you can't fit it all in memory at once.

How much of a problem that is depend greatly on what you need to do with the data, for most jobs you can do a single pass scheme where you load a block of data, do whatever needs to done with it, unload it, and go on to the next.

Sometimes issues can be solved by doing an organization pass, first going through data organizing it into chunks of data that need to be handled together, then going through each chunk.

If that strategy doesn't fit your task you can still get a long way with OS handled disk swapping, handle the data in blocks as far as possible, but if you need a little arbitrary access here and there it is still going to work.

And of course an always excellent strategy when dealing with a lot of data is to dwarf it by hardware. You can get 64 GB of memory in 16 GB blocks for $500, if you are working with that much data it is an easily justified investment. Some good SSDs is a nobrainer.

Specific case:

A big part of this job is definitely to reduce those 3GB data per person. It is often a bit of an art in its own right to find out what can be thrown away, but given the amount I must presume that you have got a fair amount of bulk measurements, in general you should first find patterns and aggregations for those data, and then use those results for comparing persons to one another. The majority of your raw data is either noise, repetition or irrelevant, you have got to cut that away.

This reduction process is easily suitable for a cluster as you can just give each process its own pile of persons.

The processing thereafter is a bit more tricky, what is optimal depend on a lot of factors, and you will probably have to do some trial and error. If you can make it fit the job try to load select pieces of data from all persons on the same computer and compare those, do the same with other pieces of data on other computers. Use those results as new data sets etc.

Related Topic