Magento 2 – How to Optimize Query for Product Quantity Bought

magento2ordersperformancequerysql

I need to collect the data about the products which are bought together with the existing product on the store (in case there are any). The data should be taken from the table, storing the order items. I already have some workouts, but all of them seems to be awkward. It seems that there should be some simple and elegant way to get this data.
Have someone already faced this task? Please, tell me in what way is it better to collect data about products crossing in Magento 2?

I have an array with product_id on the input which was got from the collection of all the products:

/** @var \Magento\Catalog\Model\Product $product */
$productIds = $product->getCollection()
                ->addWebsiteFilter($website->getId())
                ->getAllIds();

Then I collect the data separately by all the products (by id) in the following way:

foreach ($productIds as $productId) {
    $result = $this->collectNewDataForProduct($productId);
    // save data...
}

In this method: collectNewDataForProduct I collect an array with the products which encounter together with this product in the orders and the frequency of their encountering. I would like to modify this query or all the code (in case there is a good solution).
My workouts looks like this:

Query variation 1

SQL query:

SELECT
  `main_table`.`product_id`,
  (SUM(`main_table`.`product_id`) / `main_table`.`product_id`) AS `frequency`
FROM `sales_order_item` AS `main_table`
WHERE `main_table`.`order_id` IN
      (SELECT `sub_table`.`order_id`
       FROM `sales_order_item` AS `sub_table`
       WHERE `sub_table`.`product_id` = 1)
      AND
      `main_table`.`product_id` != 1
GROUP BY `main_table`.`product_id`;

PHP code for this query:

protected function collectNewDataForProduct($productId)
{
    $connection = $this->getResource()->getConnection();
    $orderItemTable = $connection->getTableName('sales_order_item');

    $select = $connection->select();
    $subSelect = clone $select;
    $subSelect->from(['sub_table' => $orderItemTable], ['order_id']);
    $subSelect->where('`sub_table`.`product_id` = ' . $productId);

    $sum = new \Zend_Db_Expr('(SUM(`main_table`.`product_id`)/`main_table`.`product_id`)');
    $select->from(['main_table' => $orderItemTable], ['product_id', 'frequency' => $sum]);
    $select->where('`main_table`.`order_id` IN (' . $subSelect->__toString() . ')');
    $select->where('`main_table`.`product_id` != ' . $productId);
    $select->group('main_table.product_id');

    $result = $connection->fetchAll($select);

    return $result;
}

Query variation 2

SQL query:

SELECT
  `sub_table`.`product_id`,
  SUM(sub_table.qty)
FROM `sales_order_item` AS `main_table`
  LEFT JOIN `sales_order_item` AS `sub_table`
    ON (main_table.order_id = sub_table.order_id AND main_table.product_id != sub_table.product_id)
WHERE main_table.product_id = 1
GROUP BY `sub_table`.`product_id`;

PHP code for this query:

$sum = new \Zend_Db_Expr('(SUM(`main_table`.`product_id`)/`main_table`.`product_id`)');
/** $this->orderItem instance of \Magento\Sales\Model\Order\Item */
$collection = $this->orderItem->getCollection()
    ->addFieldToSelect(['product_id', 'frequency' => $sum])
    ->removeFieldFromSelect('item_id');

$select = $collection->getSelect();
$connection = $collection->getConnection();
$orderItemTable = $connection->getTableName('sales_order_item');

$select->joinLeft(
    ['sub_table' => $orderItemTable],
    '`main_table`.`order_id` = `sub_table`.`order_id` AND `main_table`.`product_id` != `sub_table`.`product_id`',
    ['product_id']
);
$collection->addFieldToFilter('main_table.product_id', ['eq' => $productId]);
$select->group('sub_table.product_id');

As I've highlighted before, there variations seems to be not the best solutions and I want you to help me on how to improve the query/code in order to gain my aim.

Update:

With this code I am trying to count the product qty bought together with the product chosen (by id). For example: how many times the product A was bought with Product B. And the same with all the products, which exist in the sales_order_item table.

The result looks approximately like this:

product_id_1 | product_id_1 | count
-----------------------------------
A            | B            | 10 
A            | C            | 1

etc.

where the 1-st and 2-nd columns – are the product ids and the 3-d one is the coincidence qty by all the parameters.

Best Answer

PHP improvements

One big improvement you could in your PHP code would be to avoid using fetchAll to fetch large result sets as it will cause a heavy demand on system and possibly networks resources.

I assume you're using your $result return value in a loop so your code looks like this if I extract the method:

$result = $connection->fetchAll($select);
foreach ($result as $entry) {
     // Do something
}

On large amounts of fetched data, this code will execute for a very long time and PHP will probably run out of memory.

To avoid that you should fetch each database row separately using the fetch() method to reduce resource consumption:

$query = $connection->query($select);
while ($row = $query->fetch()) {
    // Do something
}

SQL improvements

I'm not entirely sure to understand what you want to achieve with that query.

As my database don't have your data, my best suggestion for you would be to use the EXPLAIN keywords on both your queries :

EXPLAIN SELECT
  `main_table`.`product_id`,
  (SUM(`main_table`.`product_id`) / `main_table`.`product_id`) AS `frequency`
FROM `sales_order_item` AS `main_table`
WHERE `main_table`.`order_id` IN
      (SELECT `sub_table`.`order_id`
       FROM `sales_order_item` AS `sub_table`
       WHERE `sub_table`.`product_id` = 1)
      AND
      `main_table`.`product_id` != 1
GROUP BY `main_table`.`product_id`;

This command will help you understand how MySQL intends to execute the query and the number of rows to proceed to successfully deliver a result. Here's an easy to understand example:

EXPLAIN SELECT city.name, city.district FROM city, country WHERE city.countrycode = country.code AND country.code = 'IND';

+——+——————-+————-+———-+———————-+————-+————-+———-+———+——————-+
| id | select_type | table   | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+——+——————-+————-+———-+———————-+————-+————-+———-+———+——————-+
|  1 | SIMPLE      | country | const | PRIMARY       | PRIMARY | 3       | const |    1 | Using index |
|  1 | SIMPLE      | city    | ALL   | NULL          | NULL    | NULL    | NULL | 4079 | Using where |
+——+——————-+————-+———-+———————-+————-+————-+———-+———+——————-+

In that case, when you check the rows column you can see that the current design will require MySQL to process only one record in the country table (which is indexed) but all 4079 records in the city table (which isn't). This then suggests scope for improvement using other optimization tricks - for example, adding an index to the city table.

Using this keyword should give you a good overview of the performance bottleneck with your queries I suggest you run those with a lot of entries in your tables to have accurate results.

Going further: schema performance profiling

You can use MySQL performance schema to do some calculations, everything is explained here: http://dev.mysql.com/doc/refman/5.6/en/performance-schema-query-profiling.html

NB: I'm doing my calculation on a DB with around 100k entries in the order item table.

Query 1:

I get a duration of 0,611883 seconds.

Query 2:

I get a duration of 0.263017 seconds. It's almost 3 times faster than the first query.

So your second query is definitely better in terms of performance.

Add an index

The bottleneck in your case is the product id column.

W/ my data (100k entries in the order item) here's what the EXPLAIN gives me:

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   SIMPLE  sales_flat_order_item   ALL     NULL    NULL    NULL    NULL    96309   Using where; Using temporary; Using filesort
1   SIMPLE  test    ref     IDX_SALES_FLAT_ORDER_ITEM_ORDER_ID  IDX_SALES_FLAT_ORDER_ITEM_ORDER_ID  4   dev.sales_flat_order_item.order_id  1   Using where

As you can see from the rows column, MySQL has to process all my table rows (96309) to get a result set.

So let's add a very simple index just to enlight the improvement you can get:

CREATE INDEX my_custom_index ON sales_order_item(product_id);

Results

Now if I run EXPLAIN here what I get:

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra
1   SIMPLE  sales_flat_order_item   ref     my_custom_index     my_custom_index     5   const   12  Using temporary; Using filesort
1   SIMPLE  test    ref     IDX_SALES_FLAT_ORDER_ITEM_ORDER_ID,my_custom_index  IDX_SALES_FLAT_ORDER_ITEM_ORDER_ID  4   dev.sales_flat_order_item.order_id  1   Using where

If we check the rows column, we get 12 instead of 100k before!

Now if we have a look at the performance, the query now runs in 0.000844 seconds.

So in my opinion, you should consider adding an index to your tables to improve the performance.

Usage

$collection->addExpressionFieldToSelect("page_view", "sum({{page_view}})", ["page_view"]);
//                                      ^ alias      ^ expression          ^ used fields

$collection->getSelect()->group("main_table.entity_id");

$collection->addExpressionAttributeToSelect("page_view", "sum({{page_view}})", "page_view");
//                                          ^ alias      ^ expression          ^ used attribute

$collection->getSelect()->group("main_table.entity_id");

Note that the method suggested in the doc block, groupByAttribute() is only used to group by EAV attributes, not static columns like the entity id.

collection – Subselect as main_table on Magento Collection

You can try something like:

$subQuery = new Zend_Db_Select();

...

$collection->getSelect()->from($subQuery);