Pricing

How to Test Mobile CPU and GPU Performance for Better Gaming Experience

UNLOCK A BETTER GAMING EXPERIENCE BY LEARNING HOW TO TEST AND OPTIMIZE MOBILE CPU AND GPU PERFORMANCE

Introduction

Mobile performance testing has always been a significant concern for developers and users alike, particularly when it comes to gaming. One of the critical aspects of gaming performance testing is monitoring the GPU's running state. PerfDog, an industry pioneer, now supports detailed GPU information collection (initially supporting Mali machines). This new feature provides more comprehensive data support for targeted GPU optimization and gaming performance evaluation. In this blog, we will discuss how to test the CPU and GPU of mobile devices when conducting performance testing.

Why Test Mobile CPU and GPU Performance?

CPU and GPU are the two primary components responsible for a mobile device's overall performance, especially in gaming. By testing these components, developers can optimize their games, ensuring smoother gameplay, better graphics, and improved battery life. Additionally, performance testing helps identify potential bottlenecks or issues that may impact the user experience.

Testing CPU Performance

1. Benchmarking Tools: There are several benchmarking tools available for mobile devices, such as Geekbench, AnTuTu, and 3DMark. These tools provide an overall score for the device's CPU performance, allowing you to compare it with other devices or previous test results.

2. Monitoring CPU Usage: Using tools like PerfDog, you can monitor the CPU usage of your device during gaming or other resource-intensive tasks. This information can help you identify specific processes or applications that may be causing performance issues.

3. Stress Testing: Stress testing involves running the device at its maximum capacity for an extended period. This can help you identify any potential overheating issues or CPU throttling that may occur during intense gaming sessions.

Testing GPU Performance

1. GPU Information Collection: PerfDog's latest feature allows you to collect detailed GPU information, such as Mali GPU Utilization, Mali Pixels Info, and Mali Memory & Bus Bandwidth. This data lets you analyze the GPU's performance during gaming, helping you optimize the graphics settings and identify potential bottlenecks.

2. Benchmarking Tools: Similar to CPU testing, benchmarking tools like 3DMark and GFXBench can provide an overall score for your device's GPU performance. These scores can be compared with other devices or previous test results to gauge the GPU's capabilities.

3. Real-time Monitoring: Tools like PerfDog enable real-time monitoring of GPU usage and frequency during gaming. This information can help you identify specific areas in the game that may require optimization or reveal any performance issues that may impact the user experience.

Mali GPU Utilization

Mali GPU Utilization consists of two performance metrics: Non-Fragment Utilization and Fragment Utilization. Non-fragment utilization refers to the percentage of non-fragment processing time in the GPU's total processing time, while Fragment Utilization refers to the percentage of fragment processing time in the GPU's total processing time.

The GPU processes different workloads through its basic processing pipeline data path, as shown in the image below. The Mali GPU workloads are coordinated by a job manager who schedules workloads to each processing unit within the GPU. It opens two FIFO job queues (job slots) for the graphics driver: one for non-fragment workloads, including vertex shading, tiling, geometry shading, tessellation shading, and compute shading, and another for fragment shading workloads, including rasterization, EarlyZ, FPK, fragment shading, Blender, and Tile write.

Performance metrics help identify whether the GPU bottleneck is in the Non-Fragment processing stage or the Fragment processing stage, guiding program optimization. If a GPU bottleneck occurs, either Non-Fragment Utilization or Fragment Utilization will typically be close to 100%. If both are below 100%, there may be a data dependency between Non-Fragment and Fragment.

Causes and optimization suggestions for high Non-Fragment Utilization

1. Too many vertices:
   1.1 Check for a large number of invisible vertices. Optimization suggestions: use occlusion culling, frustum culling, and backface culling.
   1.2 Many visible vertices. Optimization suggestions: use LOD to simplify models and Distance Culling.

2. Large amounts of vertex attribute data. Optimization suggestion: use medium-precision attributes and remove unnecessary attributes.

3. Complex vertex shaders. Optimization suggestion: avoid sampling textures in the Vertex Shader and use low-precision variables for calculations.

4. Using complex compute shaders or geometry shaders and tessellation shaders.

Causes and optimization suggestions for high fragment utilization:

1. Too many fragments:
   1.1 Using Mask materials causes EarlyZ and FKP mechanisms to fail. Optimization suggestions: render Mask mesh depth in advance or reduce the number of Mask mesh faces. Check if Mask can be disabled.
   1.2 Excessive translucent pixels, such as particles. Optimization suggestions: reduce the number of particles and control particle size.

2. Complex Fragment Shader:
   2.1 ALU logic calculations are time-consuming. Optimization suggestions: avoid dynamic branches, avoid using time-consuming functions, and transfer complex calculations to the VS stage.
   2.2 Texture sampling is time-consuming. Optimization suggestions: reduce the number of texture samples, use compressed format textures, and avoid using anisotropic filtering methods.
   2.3 Load/Store operations are time-consuming. Optimization suggestions: use medium-precision variables and avoid using high-precision calculations.

Pixel Throughput

Pixel Throughput refers to the average GPU cycle consumed per shaded pixel, including Non-Fragment processing Cycle and Fragment processing Cycle. For example, if the GPU's maximum frequency is 800MHz, the GPU usage is 100%, the game runs at a resolution of 1080*2340, and the FPS is 60, then:

Shaded Pixels per second (considering no OverDraw) = 1080 * 2340 * 60 = 151.6M
Pixel Throughput = 800M / 151.6 = 5.27 cycles

This indicates that rendering each pixel requires an average of 5.27 cycles under these conditions.

Since this metric measures the average GPU cycle consumed per shaded pixel, it is generally related to the complexity of the Vertex Shader or Fragment Shader. The performance indicators Non-Fragment Utilization and Fragment Utilization can be used to determine which part is the bottleneck. If Fragment is the processing bottleneck, it indicates that the Fragment Shader is complex, resulting in a longer Cycle required to process a single pixel.

Overdraw

Overdraw refers to the number of times a pixel is redrawn in a single frame. The OverDraw in PerfDog is the average OverDraw per second, calculated as:

OverDraw = Shaded Fragments per second / Screen Pixels per second

Assuming the game runs at 60 FPS, with a resolution of 1080*2340, and the number of Shaded Fragments per second is 273M, then OverDraw = 273*1000000 / (1080*2340*60) = 1.8

From the equation, it can be seen that with a fixed resolution and frame rate, the higher the OverDraw, the more fragments are processed per frame, and the higher the load. Once the load exceeds the GPU's maximum processing capacity, the frame rate will drop.

Causes and optimization suggestions for high Overdraw: games is mainly caused by the rendering of AlphaTest and AlphaBlend objects. Due to the GPU's EarlyZ and FPK mechanisms, Opaque objects have a smaller impact on OverDraw.

2. Impact of AlphaTest objects on Overdraw: The depth of AlphaTest objects can only be determined after executing the Fragment Shader. The delayed depth writing affects the efficiency of HSR in the TBDR architecture, as subsequent pixels can only be processed further after the AlphaTest pixels have executed the Fragment Shader and updated the depth buffer.

Optimization suggestions for AlphaTest objects' Overdraw:

1. Render in front-to-back order.
Reduce the pixel area of AlphaTest triangles during art production.

Impact of AlphaBlend objects on Overdraw: AlphaBlend objects can apply EarlyZ culling through opaque objects. However, AlphaBlend triangles cannot apply EarlyZ culling to each other as they cannot write depth, resulting in OverDraw during blending. In-game translucent particle effects and UI often cause OverDraw issues.

Optimization suggestions for AlphaBlend objects' Overdraw:

1. Reduce the number of translucent blending and overlay layers, such as adjusting particle quantity according to the model for translucent particle effects.
2. Reduce the screen rendering area of translucent pixels as much as possible. For example, when rendering particle effects or UI, use irregular surfaces instead of rectangular surfaces for rendering.

Bus Read/Bus Write

Bus Read and Bus Write represent the number of bytes the GPU reads from and writes to external shared memory through the system bus per second. Reading and writing to external DDR memory is very power-consuming. Generally, the power consumption is 100mW per GB/s bandwidth. Moreover, compared to the GPU's internal Cache, reading and writing to external storage have longer latency.

Bus Read bandwidth is mainly provided by the GPU's Load/Store Unit, Texture Unit, and Tile Unit. They are used to read vertex input attribute data, Uniform data, TileList data, texture data, and color/depth data. The Bus Read size depends on the amount of data read per second by these units and the hit rate of L1 and L2 caches. With a constant amount of data, the higher the cache hit rate, the smaller the Bus Read.

Bus Write bandwidth is mainly provided by the GPU's Load/Store Unit and Tile Unit. They are used to save vertex output attribute data, TileList data, and color/depth data.

Causes and optimization suggestions for high Bus Read bandwidth:

  1. Vertex attribute bandwidth:
     1.1 Reduce the number of vertices: use occlusion culling, frustum culling, distance culling, and LOD.
     1.2 Use separate buffers for vertex position attributes and other attributes.
     1.3 Remove vertex input attribute data that is not calculated in the Vertex Shader.
     1.4 Try using medium-precision attributes.

  2. Texture bandwidth:
     2.1 Use ETC2, ASTC, or other compressed formats.
     2.2 Using MipMap can improve Cache hit rate and reduce bandwidth.
     2.3 Avoid using anisotropic filtering methods.
     2.4 Make texture coordinates of adjacent pixels as continuous as possible to prevent jumping and affect the Cache hit rate.

  3. Color/depth buffer bandwidth:
     3.1 When rendering shadows, the Framebuffer only binds the depth buffer, and the color buffer is disabled.
     3.2 Post-processing RT only binds the color buffer, and the depth buffer is disabled.
     3.3 At the beginning of each frame, call the glClear function to clear the color, depth, and stencil buffers to prevent reloading the previous frame's Framebuffer data.
     3.4 Avoid calling functions like glReadPixels to get pixels from the Framebuffer.

Causes and optimization suggestions for high Bus Write bandwidth:

  1. Vertex output attribute bandwidth:
     1.1 Reduce the number of vertices: use occlusion culling, frustum culling, distance culling, and LOD.
     1.2 Use medium-precision output attributes as much as possible.

  2. TileList bandwidth:
     2.1 Reduce the number of triangles, especially micro-triangles. They can be culled according to ScreenSize.
     2.2 Check if Back Culling is enabled.

  3. Color/depth buffer bandwidth:
     3.1 When rendering shadows, the Framebuffer only binds the depth buffer, and the color buffer is disabled.
     3.2 Post-processing RT only binds the color buffer, and the depth buffer is disabled.
     3.3 For each Framebuffer, perform binding only once before unbinding or using the Framebuffer object's result; ensure all rendering commands affecting it are fully submitted.

Conclusion

Performance testing is crucial for ensuring a smooth and enjoyable gaming experience on mobile devices. By testing the CPU and GPU, developers can optimize their games, resulting in better graphics, smoother gameplay, and improved battery life. With tools like PerfDog and its new GPU information collection feature, developers can access detailed information about their mobile device's performance, allowing for more targeted optimization and performance evaluation. Stay tuned for more GPU information updates from PerfDog in the future.
 

PD网络测试推广
Latest Posts
1A review of the PerfDog evolution: Discussing mobile software QA with the founding developer of PerfDog A conversation with Awen, the founding developer of PerfDog, to discuss how to ensure the quality of mobile software.
2Enhancing Game Quality with Tencent's automated testing platform UDT, a case study of mobile RPG game project We are thrilled to present a real-world case study that illustrates how our UDT platform and private cloud for remote devices empowered an RPG action game with efficient and high-standard automated testing. This endeavor led to a substantial uplift in both testing quality and productivity.
3How can Mini Program Reinforcement in 5 levels improve the security of a Chinese bank mini program? Let's see how Level-5 expert mini-reinforcement service significantly improves the bank mini program's code security and protect sensitive personal information from attackers.
4How UDT Helps Tencent Achieve Remote Device Management and Automated Testing Efficiency Let's see how UDT helps multiple teams within Tencent achieve agile and efficient collaboration and realize efficient sharing of local devices.
5WeTest showed PC & Console Game QA services and PerfDog at Gamescom 2024 Exhibited at Gamescom 2024 with Industry-leading PC & Console Game QA Solution and PerfDog