Crash capture and analysis is not a new topic. Since the advent of software, crashes have been one of the most serious problems, and the crash rate is an important indicator of software quality. Solving crash problems is a pressing need. From the PC era to the mobile era, there have been many tools for crash capture and analysis. So, what new problems and challenges does Tencent Games' new crash analysis system solve?
To sum up, there are three points:
1. The demand for in-depth support in the vertical domains of game engines
2. The emerging trend of multi-platform game releases has led to the need for a unified platform to monitor crashes across multiple devices
3. Overseas projects need to meet overseas data compliance requirements
Based on the emerging trends and new demands in the aforementioned industries, we have developed a new crash management platform that covers almost all of Tencent Games' projects worldwide. While the background and specific requirements may be related to the gaming industry, the content shared in this article focuses on a universal thinking framework and problem-solving approach. We hope to engage in discussions with peers in the industry.
When it comes to monitoring breadth, let's start by discussing the overall thinking behind it. Then we will use the example of monitoring FOOM (Foreground Out of Memory) issues to analyze them in more detail.
Basic requirements for crash monitoring:
1. Capture all crash occurrences comprehensively.
Typical scenario: Why can't the platform find any information when users report program crashes?
2. Report all critical information during a crash.
Typical scenario: There are crash records, but insufficient information for problem localization.
Tencent WeTest has made optimizations and improvements in all the points mentioned above. For example, in the case of "stack information," apart from enhancing the stability of stack restoration, we also support the restoration of inline functions and numerical values of function parameters, aiding developers in more precise problem localization. Additionally, for "custom data" support, automatic aggregation and analysis of custom data have been implemented to improve problem localization and reproduction efficiency. Here, we will primarily focus on sharing specific practices in FOOM monitoring.
In the context of FOOM incidents, where there are no direct signals from the system, indirect approaches are employed for monitoring. A widely adopted solution in the industry is based on a 2015 article by Facebook titled "Reducing FOOMs in the Facebook iOS app - Engineering at Meta." The article proposes a strategy of eliminating regular crash occurrences, categorizing the remaining as FOOM incidents using a process of exclusion. One notable challenge of this approach is the relatively high false positive rate, as many crashes that are not FOOM-related persist after eliminating simpler scenarios. Domestic manufacturers in China, for instance, have built upon this foundation to optimize and further refine their methodologies, striving to reduce false positives while adhering to the exclusion-based framework. Despite optimizations, the fundamental strategy of employing exclusion-based methodologies remains unchanged.
However, in practice, it will be discovered that after simple exclusions, there are still many other types of crashes, even more difficult to determine than OOM. For instance, frequent CPU scheduling, thread deadlocks, startup timeouts, excessive resource utilization, runtime permission changes, disappearance of some program dependencies, and a myriad of other situations – more than a dozen scenarios where the system terminates the process. Instead of focusing on exclusions, it might be more efficient and accurate to take a reverse approach: directly determining OOM by checking whether the memory usage has reached the system's OOM threshold.
WeTest CrashSight innovatively employs big data statistics to directly calculate the OOM threshold lines for various combinations of memory sizes, device models, and system versions. With this approach, by knowing the memory usage state in the final moments before termination, one can determine if it was due to an Out-Of-Memory (OOM) event.
Taking iOS as an example, the OOM threshold lines of its system are correlated with memory size, device model, and system version. When these multiple factors are combined through cross-multiplication, the possibilities become quite extensive. For instance:
By employing big data analytics, the memory usage of all machines terminated under a specific set of parameters, moments before termination, is plotted as a curve. It is observed that the curve exhibits an upper limit for the highest memory usage, and no memory usage values surpass this threshold. This limit represents the OOM threshold for that particular parameter set. Machines whose memory usage reaches this value are terminated by the system.
The illustration below depicts the memory usage of machines under the parameters "RAM: 2GB, Device: iPhone 11, iOS Version: iOS 13" moments before system termination. The OOM threshold for this parameter combination is determined to be 1449MB.
How to better analyze and utilize the reported data? Based on their impact and effectiveness, these can be categorized into three levels:
Assistive localization primarily relies on statistical analysis to reveal patterns within the data. Foundational statistics include Top Issue Statistics, New Issue Prompts, Version Distribution, Operation Distribution, Device Model Distribution, Reporting Trends, etc. Advanced statistics encompass Custom Data-Driven Business Feature Analysis, Specific Scenario Analytics, Recommending Maximum Common Sequence from Sequential Data, and more. However, these methods remain assistive and do not achieve the level of proactive identification or further recommendation of solutions.
WeTest has implemented a "Rule-Based Automated Problem Recognition Feature," which leverages developers' experiences accumulated as rules within our platform. These rules facilitate the automatic recognition of issues, subsequently enabling the automated generation of defect reports (integrated with the defect management system), automatic alerts, and solution recommendations. This achievement encompasses the entire process of crash capture, reporting, analysis, and resolution. Currently, this has been applied successfully to high-rated projects within the company, notably in a flagship project, where over 80% of reported crashes can be automatically identified by rules. This has significantly reduced manual efforts, enhancing development efficiency.
The viability of integrating the automated recognition feature within project teams stems from the establishment of a positive feedback loop, allowing the functionality to continuously evolve and users to derive ongoing benefits.
The specific operational workflow is illustrated as follows:
1. Initial Formulation of Recognition Rules: Recognition rules are categorized into "Generic Rules" and "Custom Rules." The platform initially compiles generic rules, such as Android audio component issues or Apple GPU problems, to provide immediate assistance to project teams and familiarize them with the automated recognition process.
2. Customized Rule Incorporation: Project teams autonomously introduce rules tailored to project-specific issues.
3. Extraction of Platform-Level Rules from Project-Level Issues: Rules originating from project-level problems can also contribute to the creation of platform-level generic rules, such as memory allocation issues. This, in turn, reinforces the capability of generating generic problem rules.
By incorporating these steps, the willingness of new projects to adopt this functionality is further enhanced, thereby establishing a positive feedback loop that continually bolsters the overall capability of the feature.
Our platform and tools serve a dual purpose. They not only empower issue resolution but also optimize the efficiency of information flow and management.
Upon achieving automated crash issue recognition, the next natural step is to automate defect report generation and alerts. This integration seamlessly aligns with the holistic internal R&D process.
Looking ahead to the future, WeTest will continue to enhance its product expertise and efficiency. It remains committed to addressing practical needs, continually refining product capabilities, and polishing user-friendliness.
At present, CrashSight & PerfSight has successfully connected to several popular products and helped hundreds of companies achieve performance analysis and management of game products.
WeTest CrashSight and PerfSight are now open for free trial applications. Companies that successfully apply will receive a cumulative monthly activity quota of 12,000.
Contact us with the relevant information to apply for the trial.