A comparative analysis of human and AI performance in forensic …

Data set

A total of 58 participants (33 women and 25 men) were recruited from the UC Berkeley campus and photographed in two settings: (1) a studio setting with a fixed white background and artificial lighting with a tripod-mounted DSLR camera (4000 $\times$ 6000 pixels); and (2) an in-the-wild setting emulating a CCTV-like scene in which a narrow corridor was photographed by a ceiling-mounted GoPro camera (5184 $\times$ 3888 pixels). Each participant was assigned an anonymized identifier and photographed in the studio setting in eight neutral poses, Fig. 1a, six dynamic poses, Fig. 1b, and one neutral pose while standing next to a reference object (the same stool was used for all participants), Fig. 1c. Each participant was photographed in the wild in two static, Fig. 1d, and three dynamic poses. This process yielded a total of 812 no-reference studio images, 58 reference studio images, and 290 in-the-wild images.

Each participant’s height and weight was measured and recorded alongside their anonymized identifier. The collected female/male heights are normally distributed with a mean of 161.1/176.1 cm and a standard deviation of 5.3/8.3 cm; the average US adult female/male height is 161/175 cm with a standard deviation of 7.0/7.4 cm¹³. The collected female/male weights are 60.9/78.4 kg with a standard deviation of 11.4/12.9 kg; the average US adult female/male weight is 78.7/90.8 kg with a standard deviation of 19.7/19.8 kg¹³. While our participants’ heights closely follows the national average, our participants weighed approximately $20\%$ less than the national average and are less variable (presumably because they were drawn primarily from a University student population). Each participant was paid $\$20$.

AI

Recent advances in machine learning and computer vision have led to impressive results for estimating body shape and pose from a single image¹⁶. We previously extended this system to yield state-of-the-art body shape and pose estimation^18,19. Here we briefly describe this system.

A full-body, 3D model is fit to an image of a person using an augmented version of SMPLify-X¹⁶. The original SMPLify-X extracts 2D keypoints from the body and face, from which a 3D model is automatically fit. Although this model can accurately capture complex body poses, it does not incorporate body shape. This is because the model fitting relies only on the extracted 2D skeletal keypoints and does not consider the body shape depicted in the image. An augmented version¹⁹ of this model incorporates into the 3D modeling an additional parameter that captures the overall body shape, yielding more accurate estimates of body shape and size, Fig. 1e.

Although the 3D body model is estimated in real-world units, this metric reconstruction is highly inaccurate¹⁸, even while the overall body pose and shape are well estimated. We, therefore, adopt a different approach that scales the estimated 3D model based on a gender-specific average inter-pupillary distance (IPD). The IPD is relatively consistent, with an average adult IPD for women/men of 6.17/6.40 cm with a standard deviation of 0.36/0.34 cm²⁰. Because our 3D models do not have pupils, the pupil center is specified as the midway point between the left and right corners of the eye.

Once scaled, the 3D model is reposed into a neutral, upright pose, from which the person’s height is measured as the distance from the top of the head to a plane formed by three points on the bottom of the feet. The person’s weight is measured as the volume of the 3D model, converted to kilograms by multiplying by 1023 kg/m$^3$, corresponding to a gender-agnostic average body fat of $34\%$²¹.

Experts

We recruited 10, US-based, certified photogrammetrists (certification requires a minimum of between four and six years of experience depending on the governing body). Each expert was provided with a random subset of five in-the-wild images (each image depicted a different person) and asked to estimate the person’s height and weight (one expert declined to estimate weight). Each expert was provided with a schematic diagram of the scene with two real-world measurements consisting of the width of the back door into the hallway and the distance between the back door and the top of the stairs.

Non-experts

We recruited 325 participants from Amazon’s Mechanical Turk platform. Unlike the experts described in the previous section, who made height and weight estimates from only the in-the-wild images, our non-experts were tasked with making estimates from the no-reference studio images, the referenced studio-images, or the in-the-wild images. A representative subset of 290 (out of 812) no-reference studio images were partitioned into five non-overlapping sets of 58 images in which each photographed participant appeared only once. The 290 in-the-wild images were similarly partitioned into five non-overlapping sets of 58 images each. The 58 reference studio images were placed into a single set.

On entry into the study, each participant was assigned a random set from the above 11 possible subsets. Shown one image at a time, in random order, participants were asked to estimate the height and weight of the person depicted in the photo. Unlike the experts and AI, no additional information was provided to these non-experts.

Randomly interspersed within the 58 images were four catch trials consisting of stock photos clearly annotated with the subject’s height and weight. If a participant failed any of the catch trials, their entire set of responses were excluded. A total of 65 out of 325 participants failed to correctly complete the catch trials, and another 24 failed to complete the study, yielding a total of 236 valid responses. Participants were paid $5.00, but were not paid if they failed any of the catch trials. Each image was analyzed by an average of 22 non-experts.

Denoting the estimated height from non-expert j for image i as $\tilde{h}_{i,j}$ with true height $h_{i}$, the median individual accuracy is computed as $\text{ median}_j\left( |\tilde{h}_{i,j} – h_{i} |\right)$; the median crowd accuracy is computed as $|\text{ median}_j(\tilde{h}_{i,j}) – h_{i} |$. The individual and crowd weight errors are estimated in the same way. The median error across all images are reported in Table 1 in both absolute units (cm/kg) and as a percent of base height and weight. A median (as compared to a mean) is employed because responses within and across images are not normally distributed.

Human subjects

All data collection was approved by the UC Berkeley Committee for Protection of Human Subjects (2022-01-14999). All participants provided informed consent prior to their participation, and data collection was performed in accordance with relevant guidelines and regulations.