PI: A Vision-based Approach to Project Intelligence for Construction Site Monitoring Automation

Construction sites are generally large-scale, and diverse activities concurrently take place there. Timely and overall awareness of activities states and resource allocation is critical to many project-level management tasks, including resource leveling, progress tracking, and productivity analysis. Despite its importance, the manual approach to activity tracking and resource counting that relies on managers’ experience and diligence is still the mainstream in practice. As more and more large projects commence all over the world, it is nearly impossible for construction managers to know well about the project situation within a short time.

It is important that some vision-based methods are needed for onsite monitoring which have the following properties. First, the proposed method should be built for using far-field images, since surveillance cameras adopted in construction sites for continuous monitoring are usually mounted at a remote location to have an overall view. It is different from near-field images that can record detailed visual features of objects. Far-field images pose technical challenges to the algorithms that were developed for near-field images/videos due to the relatively low resolution of the former. Additionally, the use of the methods that were developed based on depth information is limited because of the awkward working distance of the range sensors on surveillance cameras. Second, the technique should be able to detect and analyze multiple synchronous construction activities in the field of view. To analyze the productivity, it is certainly beneficial if all these resources can be visually recognized and tracked. Algorithms based on manual features were mostly found in case-specific studies, which focus on demonstrating or evaluating the feasibility of the algorithms developed in computer vision. These efforts show the potential of using computer vision in facilitating construction project management tasks. However, these studies are oriented toward specific applications; they can hardly be extended to analyze other types of operations. Third, the methods should be fully automatic to handle the bulky data volume of continuous and real-time surveillance videos. Any manual intervention, e.g., defining specific working areas in the field of view or creating cloud point files, will necessarily increase use costs and thus erode their usefulness.

Our researchers believe that our system, combining vision recognition with machine learning, is the most appropriate solution. It will surely improve construction productivity as well as safety performance.

In the last decade, a considerable amount of literature has been published on visual object detection and construction activity recognition. These studies have contributed to taking a significant step forward in introducing computer vision technologies to the time-consuming tasks.

Previous studies on vision-based construction activity recognition are likely influenced by the object detection methods. As a result, researchers primarily focus on limited types of activities conducted by those objects, which are easy to detect using the handcrafted features. These methods can hardly be extended to analyze other activities. However, overall awareness of the states and issues of project-level tasks, e.g., resource leveling, progress tracking, and productivity analysis, requires the information of diverse and concurrent activities. There is a need for such a technique that can detect various objects in site images and recognize these construction activities relevant to them.

Our method is innovative regarding two aspects. First, we introduced the state-of-the-art deep learning technology to detect the frequently observed 22 classes of objects in site images. To implement the purpose, we collected and annotated the training dataset to fine-tune the Faster R-CNN model and evaluated the performance of the model on the test dataset. We found that the deep learning model presents consistently high APs on objects with clean boundaries and invariant forms in comparison with those published in the computer vision domain. However, for those objects with free forms or ambiguous edges, like raw rebar and formwork materials, the deep learning model present low APs.

Second, we developed a set of rules for creating relevance networks and 20 activity patterns to recognize construction activities. We introduced semantic relevance and spatial relevance to build relevance networks. Semantic relevance represents the semantic likelihood that any two objects are concurrently showing in the same construction activity. Spatial relevance is defined with 2D pixel distances in the image coordinate and informs us the observable possibility that they are involved in the same activity. Consequently, the relevance of two objects is formulated as the product of their semantic relevance and spatial relevance. Furthermore, relevance networks can serve as a tool to identify latent group activities in site images.

Based on the abovementioned method, we came up with our system PI, which is short for Project Intelligence. It is an intelligent system which could automatically recognize diverse construction resources and activities in site images and free construction managers from manually data collection. Through detecting those frequently observed objects, establishing relevance networks of them, and recognizing activities by pattern matching, this system could generate reports to help managers understand better with their construction projects.

Datasets are critical to training and testing deep neural networks. At this research stage, we focus on analyzing images taken at the foundation and structure construction stages of building projects, and a total of 22 classes of objects frequently observed are covered. We developed an object ontology, where a three-tier (categories, subcategories, and classes) tree-view structure is adopted. First, we group the objects into four categories: workers, materials/products, equipment, and general vehicles. The second tier is the subcategories under the first tier. For example, the category of materials/products is further divided into four subcategories: concrete-related, formwork-related, rebar-related, and scaffolding-related. The last tier is composed of the classes under the subcategories. For instance, there are two classes in concrete-related materials/products: concrete in placing and concrete in finishing. There is an exception with workers and general vehicles, which are not further divided into subcategories since we do not recognize their trades or activities based directly on these features.

Finally, we collected a total of 7790 images and manually annotated them in PASCAL VOC format. We used 6232 (80%) images to build the training set and 1558 (20%) images as the test set by randomly selecting one in each five images out of the general dataset into the latter.

We conducted two experiments in this study to evaluate the object detection performance of Faster R-CNN and the activity recognition performance of the proposed method respectively. Two basic metrics, which are popular in evaluating detection algorithms, were used in our experiments. We define the number of correct detections as TP (true positive), the number of wrong detections as FP (false positive), and the number of missed objects or activities as FN (false negative). Given the three definitions, precision is the first metric, which is the ratio of TP to TP + FP, and recall is the second one, which is the ratio of TP to TP + FN. The two metrics were used directly in activity recognition evaluation.

Referencing the requirements in PASCAL VOC object detection challenges, our object detection task will be judged by the precision-recall curves, which is obtained by setting the precision for recall r to the maximum precision obtained for any recall r’ > r. Eventually, the average precision (AP) measure of a specific class is computed as the area under its curve, and the mean AP (MAP) is defined the mean of the APs of all classes.

The MAP of object detection of our method is 67.3%, which is slightly higher than 67.0% of Faster R-CNN + VGG-16 trained and tested with PASCAL VOC 2012 datasets and lower than 75.9% of Faster R-CNN + VGG-16 trained and tested with COCO, PASCAL VOC 2007, and PASCAL VOC 2012 datasets. Our method proves to be in line with the state-of-the-art of object detection regarding MAP and the best APs but presents a relatively big AP variance.

This study focuses on using site images, in which some objects cannot be effectively identified even by human experts. The preliminary evaluation of activity recognition performance was conducted in four steps. First, we randomly selected 200 images from the images that we took from building projects in Hong Kong. We believe that this consideration is helpful to increase the external validity of the experiment. After that, we manually annotated and counted activities as explained the previous three cases. In this process, activity entities and group activities were ignored. Then, we used the method to recognize activities. Finally, we evaluated the performance of the method concerning the recall and precision. The experiment resulted in 62.4% precision and 87.3% recall (151 TP detections, 91 FP detections, and 22 FN detections), which indicate that the proposed method holds the potential to recognize construction activities and that there is still room for improvement.

Our system is designed with several practical expectations, i.e., using site images, detecting and analyzing multiple synchronous construction activities, and being fully automatic. Therefore, it is possible to save people’s valuable time in data collection and manipulation for onsite monitoring and concentrate their attention to solving problems that necessarily demand their expertise.

More specifically, this system can nourish several potential applications. First, the system can be used to index and classify daily site images, which are usually taken for various management purposes, e.g., quality control, safety management, and progress record, but without textual description. Automated indexing and classifying these images should be helpful. Second, since surveillance videos can be decomposed into time-lapse images, the method can be used to continuously monitor the construction resources involved in specific activities regarding working hours. Third, given site videos, it is possible to detect the states of an activity (i.e., not started, just started, ongoing, and completed). Therefore, we can establish the activity progress deviation against construction programs in real time.