How to Perform Data Mining Efficiently in PHP

data mininglaravelPHPstatistics

The moment of working on a system that gives you statistics based on some data gathered from the database has arrived in my company.

How do you efficiently gather statistics from a database in such a way that does not add too much overhead in loading a page, or too much complexity in cache management?

Currently, statistics are calculated at run-time, no data is saved or cached, the problem is that as you add new statistics, and those as well get calculated at run-time, I'll reach a point where the website is going to be reaaally slow, which is not acceptable.

The only idea that came to my mind to solve this issue is caching data that has date filters past the day of when they are calculated.

For example, let's say that I'd like to know if a user has visited a specific page between 2017-01-01 and 2017-01-08. Since today it's 2017-01-12, it's implied that this result could never change in the future, since the dates selected are old.

This is an example of how I calculate statistics in Laravel (4.x):

namespace App\Composers\Users;

use Illuminate\Support\Collection;
use User;

class ShowComposer
{
    public function compose($view)
    {
        $viewData = $view->getData();

        $view->with([
            'sellings'    => $this->getSellingStatistics($viewData['user'])
        ]);
    }

    public function getSellingStatistics(User $user)
    {
        $sellings = [];

        $getSellingsOf = function (User $user, $months) {
            $startOfMonth = \Carbon::now()->subMonths($months)->startOfMonth();
            $endOfMonth   = \Carbon::now()->subMonths($months)->endOfMonth();

             return $user
                ->mavs()
                ->whereHas('buyerProposal', function ($proposal) use ($startOfMonth, $endOfMonth) {
                    $proposal->whereBetween('sold_at', [
                        $startOfMonth, $endOfMonth
                    ]);
                })
                ->count();
        };

        $sellings['best'] = value(function () use ($getSellingsOf) {
            $months = [];

            for ($month = 0; $month < 12; $month++) {
                $startOfMonth = \Carbon::now()->subMonths($month)->startOfMonth();
                $endOfMonth   = \Carbon::now()->subMonths($month)->endOfMonth();

                $query = <<<SQL
            SELECT
                id, (SELECT COUNT(*)
                    FROM `mav`
                    INNER JOIN `mav_proposals` ON `mav`.`mav_proposal_id` = `mav_proposals`.`id`
                    WHERE sold_at BETWEEN ? AND ?
                    AND mav.user_id = users.id) AS sellings
            FROM users
            ORDER BY sellings DESC
            LIMIT 1
SQL;

                $response = \DB::select($query, [
                    $startOfMonth->toDateTimeString(),
                    $endOfMonth->toDateTimeString()
                ]);

                $user = User::find($response[0]->id);

                $months[] = $getSellingsOf($user, $month);
            }

            $months = array_reverse($months);

            return $months;
        });

        $sellings['personal'] = value(function () use ($user, $getSellingsOf) {
            $months = [];

            for ($month = 0; $month < 12; $month++) {
                $months[] = $getSellingsOf($user, $month);
            }

            $months = array_reverse($months);

            return $months;
        });

        $sellings['global'] = value(function () use ($user) {
            $months = [];

            for ($month = 0; $month < 12; $month++) {
                $startOfMonth = \Carbon::now()->subMonths($month)->startOfMonth();
                $endOfMonth   = \Carbon::now()->subMonths($month)->endOfMonth();

                $companySoldMavs = \App\Models\MAV::whereHas('buyerProposal',
                    function ($proposal) use ($startOfMonth, $endOfMonth) {
                        $proposal->whereBetween('sold_at', [
                            $startOfMonth, $endOfMonth
                        ]);
                    })->count();

                $usersWithSoldMavs = \User::whereHas('mavs', function ($mav) use ($startOfMonth, $endOfMonth) {
                    $mav->whereHas('buyerProposal', function ($proposal) use ($startOfMonth, $endOfMonth) {
                        $proposal->whereBetween('sold_at', [
                            $startOfMonth, $endOfMonth
                        ]);
                    });
                })->count();

                $months[] = ($usersWithSoldMavs > 0)
                    ? round($companySoldMavs / $usersWithSoldMavs)
                    : 0;
            }

            $months = array_reverse($months);

            return $months;
        });

        return $sellings;
    }
}

Now, here are the only two options I have thought of:

  • Calculate statistics every 24 hours and save them in a database.
  • Cache data based on the parameters used to gather the statistics.

The first option is quite complicated and it takes a lot of time to be developed proplerly.

The second option could be interesting, however I am afraid that cache is going to give me headaches sooner or later.

Is there an other way to do it efficiently? How do enterprises move themselves towards data mining? Are languages like R always used in these cases, or PHP can be just fine if used properly?

It's a new world for me, please be kind.

Best Answer

There are two problems in this.

Firstly storing results (in database or in cache etc.) so that the calculations does not have to be remade every time a page is loaded.

Secondly you need a mechanism to redo all calculations, you could have found a flaw in an old calculation or old data is updated/purged or new calculations are added.

The best way to do this is to add tables in the database that contains the calculations up and until a given time (eg. yesterday at midnight). As new calculations are added you'll need to update those tables or add new tables.

Whenever you need to you can empty the tables and run a script that does the calculations. This could very well be a script run automatically each night.

But you will also need to be able to clear the tables and run the script manually.

One important note is that the statistics tables are not supposed to contain permanent data. The data only needs to be easily regenerated.

The idea that old data never changes is a dangerous assumption. You can have data sources that no longer are considered reliable. There can be many reasons for old data to become obsolete. Designing a system to rely on old data never changing is designing for failure. You should design for not having to rely on old data.

Related Topic