Abstract
Air pollution source tracing, technically known as "source apportionment" problem, has been an important prerequisite for urban atmospheric pollution prevention and control especially in China where the pollution sources are highly heterogeneous across regions and time. However, effective and reliable source tracing has been a difficult task due to the need for pollution inventory profile which is cost-and time-consuming to obtain, as well as the need for domain expertise in data analysis and result interpretation. As large amount of historical ambient air pollution (especially particulate matter PM) data, which consists of both its chemical composition and the pollution control conclusions provided by human experts have been accumulated in the past decade, we develop a data-driven, end-to-end Smart Pollution Source Tracing (SPST) model for fully automatic estimation of PM source contributions. The proposed model is based on Gradient Tree Boosting algorithm which learns an ensemble of regression trees to learn the nonlinear mapping between ambient pollution data and the corresponding pollution source contribution. SPST model is trained and tested on both synthetic data which has been previously used as benchmark in other models, as well as the real, 5-cities atmospheric pollution data. Performance evaluation on synthetic data shows significant improvement of SPST over previous source tracing models. SPST also achieves high accuracy on real data, showing its potential to be applied in practice.