I need to collect the data about the products which are bought together with the existing product on the store (in case there are any). The data should be taken from the table, storing the order items. I already have some workouts, but all of them seems to be awkward. It seems that there should be some simple and elegant way to get this data.
Have someone already faced this task? Please, tell me in what way is it better to collect data about products crossing in Magento 2?
I have an array with product_id
on the input which was got from the collection of all the products:
/** @var \Magento\Catalog\Model\Product $product */
$productIds = $product->getCollection()
->addWebsiteFilter($website->getId())
->getAllIds();
Then I collect the data separately by all the products (by id) in the following way:
foreach ($productIds as $productId) {
$result = $this->collectNewDataForProduct($productId);
// save data...
}
In this method: collectNewDataForProduct
I collect an array with the products which encounter together with this product in the orders and the frequency of their encountering. I would like to modify this query or all the code (in case there is a good solution).
My workouts looks like this:
Query variation 1
SQL query:
SELECT
`main_table`.`product_id`,
(SUM(`main_table`.`product_id`) / `main_table`.`product_id`) AS `frequency`
FROM `sales_order_item` AS `main_table`
WHERE `main_table`.`order_id` IN
(SELECT `sub_table`.`order_id`
FROM `sales_order_item` AS `sub_table`
WHERE `sub_table`.`product_id` = 1)
AND
`main_table`.`product_id` != 1
GROUP BY `main_table`.`product_id`;
PHP code for this query:
protected function collectNewDataForProduct($productId)
{
$connection = $this->getResource()->getConnection();
$orderItemTable = $connection->getTableName('sales_order_item');
$select = $connection->select();
$subSelect = clone $select;
$subSelect->from(['sub_table' => $orderItemTable], ['order_id']);
$subSelect->where('`sub_table`.`product_id` = ' . $productId);
$sum = new \Zend_Db_Expr('(SUM(`main_table`.`product_id`)/`main_table`.`product_id`)');
$select->from(['main_table' => $orderItemTable], ['product_id', 'frequency' => $sum]);
$select->where('`main_table`.`order_id` IN (' . $subSelect->__toString() . ')');
$select->where('`main_table`.`product_id` != ' . $productId);
$select->group('main_table.product_id');
$result = $connection->fetchAll($select);
return $result;
}
Query variation 2
SQL query:
SELECT
`sub_table`.`product_id`,
SUM(sub_table.qty)
FROM `sales_order_item` AS `main_table`
LEFT JOIN `sales_order_item` AS `sub_table`
ON (main_table.order_id = sub_table.order_id AND main_table.product_id != sub_table.product_id)
WHERE main_table.product_id = 1
GROUP BY `sub_table`.`product_id`;
PHP code for this query:
$sum = new \Zend_Db_Expr('(SUM(`main_table`.`product_id`)/`main_table`.`product_id`)');
/** $this->orderItem instance of \Magento\Sales\Model\Order\Item */
$collection = $this->orderItem->getCollection()
->addFieldToSelect(['product_id', 'frequency' => $sum])
->removeFieldFromSelect('item_id');
$select = $collection->getSelect();
$connection = $collection->getConnection();
$orderItemTable = $connection->getTableName('sales_order_item');
$select->joinLeft(
['sub_table' => $orderItemTable],
'`main_table`.`order_id` = `sub_table`.`order_id` AND `main_table`.`product_id` != `sub_table`.`product_id`',
['product_id']
);
$collection->addFieldToFilter('main_table.product_id', ['eq' => $productId]);
$select->group('sub_table.product_id');
As I've highlighted before, there variations seems to be not the best solutions and I want you to help me on how to improve the query/code in order to gain my aim.
Update:
With this code I am trying to count the product qty bought together with the product chosen (by id). For example: how many times the product A was bought with Product B. And the same with all the products, which exist in the sales_order_item
table.
The result looks approximately like this:
product_id_1 | product_id_1 | count
-----------------------------------
A | B | 10
A | C | 1
etc.
where the 1-st and 2-nd columns – are the product ids and the 3-d one is the coincidence qty by all the parameters.
Best Answer
PHP improvements
One big improvement you could in your PHP code would be to avoid using
fetchAll
to fetch large result sets as it will cause a heavy demand on system and possibly networks resources.I assume you're using your
$result
return value in a loop so your code looks like this if I extract the method:On large amounts of fetched data, this code will execute for a very long time and PHP will probably run out of memory.
To avoid that you should fetch each database row separately using the
fetch()
method to reduce resource consumption:SQL improvements
I'm not entirely sure to understand what you want to achieve with that query.
As my database don't have your data, my best suggestion for you would be to use the
EXPLAIN
keywords on both your queries :This command will help you understand how MySQL intends to execute the query and the number of rows to proceed to successfully deliver a result. Here's an easy to understand example:
In that case, when you check the rows column you can see that the current design will require MySQL to process only one record in the country table (which is indexed) but all 4079 records in the city table (which isn't). This then suggests scope for improvement using other optimization tricks - for example, adding an index to the city table.
Using this keyword should give you a good overview of the performance bottleneck with your queries I suggest you run those with a lot of entries in your tables to have accurate results.
Going further: schema performance profiling
You can use MySQL performance schema to do some calculations, everything is explained here: http://dev.mysql.com/doc/refman/5.6/en/performance-schema-query-profiling.html
NB: I'm doing my calculation on a DB with around 100k entries in the order item table.
Query 1:
I get a duration of 0,611883 seconds.
Query 2:
I get a duration of 0.263017 seconds. It's almost 3 times faster than the first query.
So your second query is definitely better in terms of performance.
Add an index
The bottleneck in your case is the product id column.
W/ my data (100k entries in the order item) here's what the
EXPLAIN
gives me:As you can see from the rows column, MySQL has to process all my table rows (96309) to get a result set.
So let's add a very simple index just to enlight the improvement you can get:
Results
Now if I run
EXPLAIN
here what I get:If we check the rows column, we get 12 instead of 100k before!
Now if we have a look at the performance, the query now runs in 0.000844 seconds.
So in my opinion, you should consider adding an index to your tables to improve the performance.