Interviewer: How did you query 10 million data?
Aug 15, 2023 pm 04:34 PMRecently I have been doing mock interviews and resume optimization for everyone, and I found that many people see what tens of millions of data and the like Interview questions will make you weak.
Maybe some people have never encountered a table with tens of millions of data, and they don’t know what will happen when querying tens of millions of data.
Today I will take you through a practical operation. This time it is based on MySQL 5.7.26 for testing
Preparing data
What should I do if I don’t have 10 million data?
Can’t you create it without data?
Is it difficult to create data?
The code creates 10 million?
That's impossible, it's too slow, and it might take a whole day. You can use database scripts to execute much faster.
Create table
CREATE TABLE `user_operation_log` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `ip` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `op_data` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr1` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr2` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr3` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr4` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr5` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr6` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr7` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr8` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr9` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr10` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr11` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr12` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, PRIMARY KEY (`id`) USING BTREE ) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;
Create data script
Use batch Insert, the efficiency will be much faster, and every 1000 items will be committed. If the amount of data is too large, it will also lead to slow batch insertion efficiency.
DELIMITER ;; CREATE PROCEDURE batch_insert_log() BEGIN DECLARE i INT DEFAULT 1; DECLARE userId INT DEFAULT 10000000; set @execSql = 'INSERT INTO `test`.`user_operation_log`(`user_id`, `ip`, `op_data`, `attr1`, `attr2`, `attr3`, `attr4`, `attr5`, `attr6`, `attr7`, `attr8`, `attr9`, `attr10`, `attr11`, `attr12`) VALUES'; set @execData = ''; WHILE i<=10000000 DO set @attr = "'測試很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長很長的屬性'"; set @execData = concat(@execData, "(", userId + i, ", '10.0.69.175', '用戶登錄操作'", ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ")"); if i % 1000 = 0 then set @stmtSql = concat(@execSql, @execData,";"); prepare stmt from @stmtSql; execute stmt; DEALLOCATE prepare stmt; commit; set @execData = ""; else set @execData = concat(@execData, ","); end if; SET i=i+1; END WHILE; END;; DELIMITER ;
Start testing
My computer configuration is relatively low: win10 standard pressure i5, read and write about 500MB SSD
Due to the low configuration, I only prepared for this test 3148000 pieces of data were obtained, occupying 5G of disk (without indexing), and ran for 38 minutes. Students with good computer configuration can insert multiple points of data for testing
SELECT count(1) FROM `user_operation_log`
Return result: 3148000
The three query times are:
14060 ms 13755 ms 13447 ms
普通分頁查詢
MySQL 支持 LIMIT 語句來選取指定的條數(shù)數(shù)據(jù), Oracle 可以使用 ROWNUM 來選取。
MySQL分頁查詢語法如下:
SELECT * FROM table LIMIT [offset,] rows | rows OFFSET offset
第一個參數(shù)指定第一個返回記錄行的偏移量 第二個參數(shù)指定返回記錄行的最大數(shù)目
下面我們開始測試查詢結果:
SELECT * FROM `user_operation_log` LIMIT 10000, 10
查詢3次時間分別為:
59 ms 49 ms 50 ms
這樣看起來速度還行,不過是本地數(shù)據(jù)庫,速度自然快點。
換個角度來測試
相同偏移量,不同數(shù)據(jù)量
SELECT * FROM `user_operation_log` LIMIT 10000, 10 SELECT * FROM `user_operation_log` LIMIT 10000, 100 SELECT * FROM `user_operation_log` LIMIT 10000, 1000 SELECT * FROM `user_operation_log` LIMIT 10000, 10000 SELECT * FROM `user_operation_log` LIMIT 10000, 100000 SELECT * FROM `user_operation_log` LIMIT 10000, 1000000
查詢時間如下:
Quantity | First time | Second time | Third time |
---|---|---|---|
10 items | 53ms | 52ms | 47ms |
100 items | 50ms | 60ms | 55ms |
61ms | 74ms | 60ms | |
164ms | 180ms | 217ms | |
1609ms | 1741ms | 1764ms | |
16219ms | 16889ms | 17081ms |
偏移量 | 第一次 | 第二次 | 第三次 |
---|---|---|---|
100 | 36ms | 40ms | 36ms |
1000 | 31ms | 38ms | 32ms |
10000 | 53ms | 48ms | 51ms |
100000 | 622ms | 576ms | 627ms |
1000000 | 4891ms | 5076ms | 4856ms |
從上面結果可以得出結束:偏移量越大,花費時間越長
SELECT * FROM `user_operation_log` LIMIT 100, 100 SELECT id, attr FROM `user_operation_log` LIMIT 100, 100
如何優(yōu)化
既然我們經(jīng)過上面一番的折騰,也得出了結論,針對上面兩個問題:偏移大、數(shù)據(jù)量大,我們分別著手優(yōu)化
優(yōu)化偏移量大問題
采用子查詢方式
我們可以先定位偏移位置的 id,然后再查詢數(shù)據(jù)
SELECT * FROM `user_operation_log` LIMIT 1000000, 10 SELECT id FROM `user_operation_log` LIMIT 1000000, 1 SELECT * FROM `user_operation_log` WHERE id >= (SELECT id FROM `user_operation_log` LIMIT 1000000, 1) LIMIT 10
查詢結果如下:
sql | 花費時間 |
---|---|
第一條 | 4818ms |
第二條(無索引情況下) | 4329ms |
第二條(有索引情況下) | 199ms |
第三條(無索引情況下) | 4319ms |
第三條(有索引情況下) | 201ms |
從上面結果得出結論:
第一條花費的時間最大,第三條比第一條稍微好點 子查詢使用索引速度更快
缺點:只適用于id遞增的情況
id非遞增的情況可以使用以下寫法,但這種缺點是分頁查詢只能放在子查詢里面
注意:某些 mysql 版本不支持在 in 子句中使用 limit,所以采用了多個嵌套select
SELECT * FROM `user_operation_log` WHERE id IN (SELECT t.id FROM (SELECT id FROM `user_operation_log` LIMIT 1000000, 10) AS t)
采用 id 限定方式
這種方法要求更高些,id必須是連續(xù)遞增,而且還得計算id的范圍,然后使用 between,sql如下
SELECT * FROM `user_operation_log` WHERE id between 1000000 AND 1000100 LIMIT 100 SELECT * FROM `user_operation_log` WHERE id >= 1000000 LIMIT 100
查詢結果如下:
sql | 花費時間 |
---|---|
第一條 | 22ms |
第二條 | 21ms |
從結果可以看出這種方式非???/p>
注意:這里的 LIMIT 是限制了條數(shù),沒有采用偏移量
優(yōu)化數(shù)據(jù)量大問題
返回結果的數(shù)據(jù)量也會直接影響速度
SELECT * FROM `user_operation_log` LIMIT 1, 1000000 SELECT id FROM `user_operation_log` LIMIT 1, 1000000 SELECT id, user_id, ip, op_data, attr1, attr2, attr3, attr4, attr5, attr6, attr7, attr8, attr9, attr10, attr11, attr12 FROM `user_operation_log` LIMIT 1, 1000000
查詢結果如下:
sql | 花費時間 |
---|---|
第一條 | 15676ms |
第二條 | 7298ms |
第三條 | 15960ms |
It can be seen from the results that by reducing unnecessary columns, the query efficiency can also be significantly improved
The first and third query speeds are almost the same. At this time, you will definitely complain, then I Why write so many fields? Just * and you’re done.
Note that my MySQL server and client are on the same machine, so the query data is similar. Qualified students can test it. Test the client separately from MySQL
SELECT * Isn’t it delicious?
By the way, I would like to add why SELECT *
is banned. Isn't it delicious because it's simple and mindless?
Two main points:
Using " SELECT *
" the database needs to parse more objects, fields, permissions, attributes and other related content. When the SQL statements are complex and there are many hard parses, it will put a heavy burden on the database.Increases network overhead, *
Sometimes useless and large text fields such as log and IconMD5 are mistakenly added, and the data transmission size will increase geometrically. Especially since MySQL and the application are not on the same machine, this overhead is very obvious.
The above is the detailed content of Interviewer: How did you query 10 million data?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Singleton design pattern in Java ensures that a class has only one instance and provides a global access point through private constructors and static methods, which is suitable for controlling access to shared resources. Implementation methods include: 1. Lazy loading, that is, the instance is created only when the first request is requested, which is suitable for situations where resource consumption is high and not necessarily required; 2. Thread-safe processing, ensuring that only one instance is created in a multi-threaded environment through synchronization methods or double check locking, and reducing performance impact; 3. Hungry loading, which directly initializes the instance during class loading, is suitable for lightweight objects or scenarios that can be initialized in advance; 4. Enumeration implementation, using Java enumeration to naturally support serialization, thread safety and prevent reflective attacks, is a recommended concise and reliable method. Different implementation methods can be selected according to specific needs

ThreadLocal is used in Java to create thread-private variables, each thread has an independent copy to avoid concurrency problems. It stores values ??through ThreadLocalMap inside the thread. Pay attention to timely cleaning when using it to prevent memory leakage. Common uses include user session management, database connections, transaction context, and log tracking. Best practices include: 1. Call remove() to clean up after use; 2. Avoid overuse; 3. InheritableThreadLocal is required for child thread inheritance; 4. Do not store large objects. The initial value can be set through initialValue() or withInitial(), and the initialization is delayed until the first get() call.

Analyzing Java heap dumps is a key means to troubleshoot memory problems, especially for identifying memory leaks and performance bottlenecks. 1. Use EclipseMAT or VisualVM to open the .hprof file. MAT provides Histogram and DominatorTree views to display the object distribution from different angles; 2. sort in Histogram by number of instances or space occupied to find classes with abnormally large or large size, such as byte[], char[] or business classes; 3. View the reference chain through "ListObjects>withincoming/outgoingreferences" to determine whether it is accidentally held; 4. Use "Pathto

Optional can clearly express intentions and reduce code noise for null judgments. 1. Optional.ofNullable is a common way to deal with null objects. For example, when taking values ??from maps, orElse can be used to provide default values, so that the logic is clearer and concise; 2. Use chain calls maps to achieve nested values ??to safely avoid NPE, and automatically terminate if any link is null and return the default value; 3. Filter can be used for conditional filtering, and subsequent operations will continue to be performed only if the conditions are met, otherwise it will jump directly to orElse, which is suitable for lightweight business judgment; 4. It is not recommended to overuse Optional, such as basic types or simple logic, which will increase complexity, and some scenarios will directly return to nu.

The core workaround for encountering java.io.NotSerializableException is to ensure that all classes that need to be serialized implement the Serializable interface and check the serialization support of nested objects. 1. Add implementsSerializable to the main class; 2. Ensure that the corresponding classes of custom fields in the class also implement Serializable; 3. Use transient to mark fields that do not need to be serialized; 4. Check the non-serialized types in collections or nested objects; 5. Check which class does not implement the interface; 6. Consider replacement design for classes that cannot be modified, such as saving key data or using serializable intermediate structures; 7. Consider modifying

ToimproveperformanceinJavaapplications,choosebetweenEhCacheandCaffeinebasedonyourneeds.1.Forlightweight,modernin-memorycaching,useCaffeine—setitupbyaddingthedependency,configuringacachebeanwithsizeandexpiration,andinjectingitintoservices.2.Foradvance

There are three common ways to parse JSON in Java: use Jackson, Gson, or org.json. 1. Jackson is suitable for most projects, with good performance and comprehensive functions, and supports conversion and annotation mapping between objects and JSON strings; 2. Gson is more suitable for Android projects or lightweight needs, and is simple to use but slightly inferior in handling complex structures and high-performance scenarios; 3.org.json is suitable for simple tasks or small scripts, and is not recommended for large projects because of its lack of flexibility and type safety. The choice should be decided based on actual needs.

Serialization is the process of converting an object into a storageable or transferable format, while deserialization is the process of restoring it to an object. Implementing the Serializable interface in Java can use ObjectOutputStream and ObjectInputStream to operate. 1. The class must implement the Serializable interface; 2. All fields must be serializable or marked as transient; 3. It is recommended to manually define serialVersionUID to avoid version problems; 4. Use transient to exclude sensitive fields; 5. Rewrite readObject/writeObject custom logic; 6. Pay attention to security, performance and compatibility
