Crash capture and analysis is an enduring topic in software development. Since the inception of software, crashes have posed serious challenges, and crash rates remain a crucial metric for evaluating software quality. Addressing crash issues is a fundamental requirement. From the PC era to the mobile era, numerous tools have been developed for crash capture and analysis. So, what novel problems and challenges does Tencent Games' new crash analysis system tackle?
In essence, there are three main points:
The necessity for deep support within the vertical domain of game engine layers.
The emerging trend of multi-platform game releases and the requirement to view crashes across platforms in a unified manner.
Overseas projects needing to adhere to overseas data compliance regulations.
Given these new trends and demands, we ultimately constructed a new crash management platform that now encompasses nearly all Tencent Games' projects, both domestically and internationally. Although it is based on the unique needs of the gaming industry, this article offers a universal framework of thought and problem-solving methodologies, which we aim to share with our peers in this field.
Regarding the monitoring scope, we will first outline the overall approach and then delve into the practice of monitoring FOOM issues as an illustrative example.
Fundamental requirements for crash monitoring:
Crash Capture Completeness: A typical scenario is when users report a crash but the platform fails to detect it.
Comprehensive Critical Information Reporting: Another common scenario is having a crash record but insufficient information to pinpoint the issue.
Our platform has optimized and refined nearly all the aforementioned points. For instance, for crucial "stack trace" information, beyond enhancing stack restoration stability, it also supports inline function restoration and function parameter value restoration, aiding developers in more precisely locating problems. For "custom data" support, custom data can be automatically aggregated and analyzed, elevating the efficiency of problem location and reproduction. Here, we specifically share the practice of FOOM (Foreground Out of Memory) monitoring.
In FOOM scenarios, due to the absence of direct system signals, only indirect methods can be employed for monitoring. The industry's conventional approach, based on a 2015 Facebook article titled "Reducing FOOMs in the Facebook iOS App - Engineering at Meta," involves excluding normal crash scenarios, deeming the remainder as FOOM through a process of elimination. However, this solution suffers from a relatively high false positive rate; after excluding simple cases, many crashes still remain that are not FOOM. Domestic vendors mostly optimize on this foundation by further excluding other scenarios to decrease the false positive rate, but the elimination approach remains unchanged.
Yet, in practice, it has been observed that after simple elimination, various types of crash still persist, some even harder to discern than OOM. These encompass scenarios like excessive CPU scheduling, thread deadlocks, startup timeouts, excessive resource usage, runtime permission changes, and missing program dependencies, among others – a dozen or so situations where the system forcibly terminates the app. Rather than excluding these, it is more convenient and accurate to directly ascertain OOM, i.e., whether memory usage has surpassed the system's OOM threshold.
This platform innovatively utilizes big data statistics to directly compute the OOM threshold across different memory capacities, device models, and system versions. By understanding the memory usage status in the final period before the app is forcibly terminated, one can ascertain whether it is OOM.
Taking iOS as an example, its system OOM threshold correlates with memory capacity, device model, and system version. The combination of these dimensions is vast. For instance:
"RAM: 3GB, Device Model: iPhone X, System Version: iOS 12" has an OOM threshold of 1800MB.
"RAM: 3GB, Device Model: iPhone X, System Version: iOS 13" has an OOM threshold of 1849MB with a system version change.
"RAM: 3GB, Device Model: iPhone 11, System Version: iOS 13" has an OOM threshold of 2098MB with a device model change.
Using big data, by plotting the memory usage of all devices forcibly terminated under a set of parameters at the moment before termination as a curve, it is evident that there is an upper limit for the maximum memory usage value, and no value surpasses this line, which is the system OOM threshold under these parameters. Devices hitting this value are forcibly terminated by the system.
The diagram below illustrates the memory usage of devices with "RAM: 2GB, Device Model: iPhone 11, System Version: iOS 13" at the moment before being forcibly terminated by the system. The system OOM threshold under these parameters is 1449MB.
How can reported data be better analyzed and utilized? Based on their effectiveness, they can be categorized into three levels:
Assistance in Location
Active Identification
Problem Resolution
For assistance in Location, it primarily relies on statistical analysis to reveal data patterns. This encompasses basic statistics like top problem statistics, new problem alerts, version distribution, operation distribution, device model distribution, and reporting trends. Advanced statistics include business feature statistics tailored to custom data, specific scenario statistics, and recommendations for the most common sequences in sequential data. However, these remain auxiliary and haven't yet reached the level of active identification or solution recommendation.
This platform features an "automated problem identification function based on rules," allowing developers' experience to be accumulated on the platform in the form of rules. These rules facilitate automatic problem identification, automatic defect report creation (integrated with the defect management system), automatic alerting, and solution recommendations. This completes the entire lifecycle of crash capture, reporting, analysis, and resolution. Currently, it's applied to high-priority projects within the company. For a particular top project, over 80% of reported crashes can be automatically identified by rules, significantly reducing labor costs and enhancing development efficiency.
From a product perspective, the usability of the automated identification function by project teams benefits from forming a positive feedback loop, continuously improving the overall function and benefiting users.
The specific operation flow is as follows:
Firstly, problem identification rules are categorized into "general rules" and "custom rules." The platform initially summarizes general rules, such as issues with Android audio components and Apple GPU problems, aiding project teams immediately upon adoption and familiarizing them with the operation process of automatic identification.
Secondly, project teams add project-specific problem rules.
Thirdly, project-specific problem rules can be further extracted into platform-level general rules, such as memory allocation issues, enhancing the capabilities of general problem rules.
This increases the willingness of new projects to adopt this function, fostering a positive feedback loop that continuously enhances their overall function.
The purpose of platforms and tools is, on the one hand, to provide problem-solving capabilities. On the other hand, it is to improve the efficiency of information flow and management.
Once crash issues can be automatically identified, the subsequent step is to automatically create defect reports and send alerts, necessitating integration into the entire internal R&D process.
Projects will always have unique needs, such as phased data analysis, quality reports, and automated integration. For these long-tail, non-standard needs, the platform employs a flexible API to accommodate them.
Looking forward, the platform aims to continually improve in terms of product professionalism and efficiency. Simultaneously, it will adhere to addressing actual needs, consistently enhancing product capabilities, and refining usability. Click to read the original article and try out WeTest's new capabilities for free.
Are you curious about how CrashSight can elevate your game testing experience? Or perhaps you'd like to dive deeper into our other cutting-edge testing strategies? Either way, we'd love to hear from you. Our expert team is here to connect and provide you with the guidance and support you need to ensure your game testing is both efficient and precise.
Book a Meeting with us!
Furthermore, we cordially invite you to try out Tencent's UDT platform, a cloud-based solution that grants you remote access to devices and seamlessly integrates with your local test devices, thereby broadening your testing horizons. We firmly believe that UDT can bring unmatched convenience and efficiency to your game testing endeavors.
WeTest, with over a decade of experience in quality management, is an integrated quality cloud platform dedicated to establishing global quality standards and enhancing product quality. As a member of the IEEE, approved Global Game Quality Assurance Working Group, it is recognized for its commitment to quality assurance. WeTest has served over 10,000 enterprise clients across 140+ countries.
Focusing on advanced testing tools development, WeTest integrates AI technology to launch professional game testing tools such as PerfDog, CrashSight, and UDT (Next-Gen Multi Terminal Unified Access Management Automated Testing Platform), aiding over a million developers worldwide in boosting efficiency. Additionally, WeTest offers comprehensive testing service solutions for mobile, PC, and console games, covering compatibility, security, functionality, localization testing and other various services, ensuring product quality for over one thousand game companies globally.
Give it a try for free today. Register Now!