Joining tables is a crucial operation in MySQL, and many students might have concerns about using join operations. They may wonder about the performance impact of joining tables, how to establish the join table index, and whether it's better to perform queries in stages or join tables. To address these concerns, I will provide a practical example to demonstrate how to optimize a join table query in SQL.
// Person - Group many-to-one relationship table
CREATE TABLE `Person_Group` (
`person_id` int(11) unsigned NOT NULL COMMENT 'User ID',
`group_id` int(11) unsigned NOT NULL COMMENT 'Group ID',
`extend` varchar(1000) DEFAULT '[]' COMMENT 'Additional permissions',
PRIMARY KEY (`person_id`),
KEY `person_group` (`person_id`,`group_id`),
KEY `group` (`group_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
// Group - Privilege many-to-many relationship table
CREATE TABLE `Group_Privilege` (
`group_id` int(11) unsigned NOT NULL COMMENT 'Group ID',
`privilege_id` int(11) unsigned NOT NULL COMMENT 'Permission',
PRIMARY KEY (`group_id`,`privilege_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
// Statement to be optimized
SELECT Person_Group.person_id FROM Person_Group JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008;
The purpose of the SQL statement is to retrieve all user IDs with the 20008 permission. Although the SQL is simple, it takes 29 milliseconds to execute, which suggests a performance issue that requires optimization. To better understand how this join SQL is executed, let's use the EXPLAIN command.
EXPLAIN SELECT Person_Group.person_id FROM Person_Group JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Person_Group
type: ALL
possible_keys: group
key: NULL
key_len: NULL
ref: NULL
rows: 988054
Extra:
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: Group_Privilege
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 8
ref: test.Person_Group.group_id,const
rows: 1
Extra: Using index
2 rows in set (0.00 sec)
Based on the analysis of the explained results, the slow join query's primary cause is the traversal of the Person_Group table. Person_Group is a large table with nearly one million rows, while Group_Privilege is a much smaller table with just over 300 rows. This scenario is a typical large table-small table join query, which is quite common in everyday SQL writing. If MySQL traverses the large table during the query process, it will significantly impact performance. The general method to optimize SQL is to add indexes to avoid full traversal of the large table. However, the WHERE clause in this statement restricts the privilege_id of Group_Privilege. If an index is added to the privilege_id, it will only reduce the number of traversals for the small table Group_Privilege but not reduce the traversal of Person_Group. As a result, the optimization is stuck in a deadlock.
Upon discussing with experienced colleagues, they suggested changing the order of the large table and small table in the join query to see how the results would be affected. Since there are no NULL group_id values in both the Person_Group and Group_Privilege tables, the results will be the same regardless of how they are joined. In addition, it is known that the query speed of small table-large table joins is faster than large table-small table joins. With this in mind, I decided to give it a try and modified the query statement accordingly.
SELECT Person_Group.person_id FROM Person_Group Right JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008;
Please use the "EXPLAIN" statement to see how the query is executed.
explain SELECT Person_Group.person_id FROM Person_Group Right JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Group_Privilege
type: index
possible_keys: NULL
key: PRIMARY
key_len: 8
ref: NULL
rows: 338
Extra: Using where; Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: Person_Group
type: ref
possible_keys: group
key: group
key_len: 4
ref: test.Group_Privilege.group_id
rows: 247293
Extra: Using index
2 rows in set (0.00 sec)
Even though the small table being traversed first has only 338 rows, the large table being traversed afterward has 24,729 rows. If we calculate the total number of traversals, it would be 338 × 24,729 = 83,585,034, which is significantly larger than the previous 988,054. It appears that changing the order has made the performance even worse. However, let's just execute the query and see how it performs.
*************************** 13. row ***************************
Query_ID: 1
Duration: 0.02994200
Query: SELECT Person_Group.person_id FROM Person_Group JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008
*************************** 14. row ***************************
Query_ID: 2
Duration: 0.00039700
Query: SELECT Person_Group.person_id FROM Person_Group RIGHT JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008
Upon reviewing the results, I was astonished to discover that the small table-large table join took only 0.39 milliseconds, making it over 70 times faster than expected. This outcome seemed counterintuitive and almost unscientific! I quickly composed myself and sought an explanation in the book "High Performance MySQL." According to the text, MySQL performs nested loop join operations for all joins. This means that it first iterates through the data in the initial table, then takes the retrieved data and searches for corresponding data in the subsequent table until all rows have been identified. In the case of our previous inner join query, MySQL would process it as follows.
outerIter = iterator over Person_Group
outerRow = outerIter.next
while outerRow # This loop runs 988,054 times
innerIter = iterator over Group_Privilege where group_id=outerRow.group_id and privilege_id=20008 # This directly accesses the composite primary key (group_id and privilege_id)
innerRow = innerIter.next
while innerRow
output[outerRow.person_id]
innerRow = innerIter.next
end
outerRow = outerIter.next
end
Once the order is changed to a right join, the pseudo-code for MySQL execution becomes:
outer_iter = iterator_over Group_Privilege where privilege_id=20008
outer_row = outer_iter.next
while outer_row # Since there is no privilege_id index, this loop runs 338 times
inner_iter = iterator over Person_Group where group_id=outer_row.group_id # This directly hits the Person_Group index group_id
inner_row = inner_iter.next
if inner_row
while inner_row
output[inner_row.person_id]
inner_row = inner_iter.next
end
else
output[NULL]
out_row = outer_iter.next
end
In the process of executing pseudo-code in MySQL, we identified key issues. The initial query inefficiently traversed more than 90,000 rows of the Person_Group table first, and in the inner query of Group_Privilege, it hit directly in one attempt. This caused the query to loop over a million times. By changing the order, MySQL only needs to traverse 338 rows of Group_Privilege initially, and then use the index of group_id in Person_Group for the inner query of Person_Group. This makes the inner query much faster, eliminating the need to traverse the entire Person_Group. This is why joining a small table to a large table is much faster than joining a large table to a small table. The key to optimizing multi-table queries is using the index correctly. For the same table structure, changing the query order of the outer and inner tables may not seem significant, but it changes the indexes used in the query, resulting in a substantial performance difference.
Another question arises: why does the inner query of joining a small table to a large table have 247,293 rows in the explain output? Shouldn't there be fewer rows if an index is used? The truth is, it's unclear how this 247,293 value is calculated. Although MySQL knows that the index on group_id will be used in the inner query, since explain does not execute SQL, it does not know the specific value of group_id in the inner query (this is the result of the outer traversal). So MySQL cannot determine how many rows the index will hit, and thus cannot accurately judge how many rows will be traversed. This means that the number of rows traversed in the inner query may be inaccurate in explain.
Since simply adding a "right" join can improve performance so much, let's continue to optimize. It is clear that the right join query does not have a privilege_id index in the outer layer, so it traverses all Group_Privilege rows. Although the number of rows is not large, optimization is still necessary. By adding an index and running explain, we find that the outer layer only needs to traverse 3 rows.
EXPLAIN SELECT Person_Group.person_id FROM Person_Group RIGHT JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Group_Privilege
type: ref
possible_keys: privilege
key: privilege
key_len: 4
ref: const
rows: 3
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: Person_Group
type: ref
possible_keys: group
key: group
key_len: 4
ref: test.Group_Privilege.group_id
rows: 247011
Extra: Using index
2 rows in set (0.00 sec)
The optimization of this statement seems nearly perfect now, but when I casually run "explain" on the initial inner join SQL, I am shocked to see that everything has changed!
explain select Person_Group.person_id FROM Person_Group JOIN Group_Privilege ON Person_Group.group_id=Group_Privilege.group_id WHERE Group_Privilege.privilege_id=20008\G;
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Group_Privilege
type: ref
possible_keys: PRIMARY,privilege
key: privilege
key_len: 4
ref: const
rows: 3
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: Person_Group
type: ref
possible_keys: group
key: group
key_len: 4
ref: test.Group_Privilege.group_id
rows: 247011
Extra: Using index
2 rows in set (0.00 sec)
For the INNER JOIN SQL statement, MySQL initially performs an outer query on Person_Group and then an inner query on Group_Privilege. However, after adding an index, it switches to an outer query on Group_Privilege and an inner query on Person_Group. With the index added, the execution effects of INNER JOIN and RIGHT JOIN become identical! I can't help but marvel at MySQL's mysterious optimization mechanism. In fact, for INNER JOIN SQL statements, expanding the left table first or the right table first doesn't affect the query results. Therefore, MySQL takes various factors into account to choose the optimal expansion order. Although MySQL usually optimizes well, it's not always correct, and the example in the text proves this. So, SQL optimization is a process of accumulating knowledge and experience. Only through continuous practice, analysis, and optimization in real-world scenarios can we achieve the best results.
To briefly summarize, although MySQL has built-in optimizations for joining tables, they're not always reliable. It's recommended to identify poorly performing SQL queries in your work, optimize and adjust them based on actual data, determine the join order, and establish appropriate indexes to improve query speed.