Facial Image Recognition Guide
Image Quality
Understanding what constitutes quality images and how to optimize the quality of images used in 3VR systems is critical to developing viable solutions. Meeting specific requirements for 3VR facial surveillance is much more challenging than in any traditional CCTV deployment. Accordingly, many partners and users are unaccustomed to being concerned about these issues and can easily overlook them.
By understanding and optimizing image quality, you will be able to:
Better qualify what opportunities are strong fits for 3VR
Set the appropriate level of expectations with partners and users
Design a system that accommodates real world conditions
Deliver a solution that is optimally effective
Imaging Background
Overview
Two aspects of imaging are most important in understanding and optimizing 3VR facial surveillance:
Resolution
Field of View
Without understanding the impact of these two aspects, it will not be possible to master use of 3VR for either importing images or capturing video. As such, prior to analyzing facial surveillance or its application to 3VR, these aspects shall be explained and relevant introductory material shall be presented.
Key Elements in Optimizing Images for Facial Surveillance
This report will examine and explain the many elements that are critical to using 3VR for facial surveillance. While fundamental principles apply to images from photographic cameras as well as CCTV cameras, the application will differ. The following summarizes the key elements so that the reader may have a quick reference for future use.
Any image, whether video or photo, requires sufficient detail. Detail is determined by
the level of resolution in the image
the size of the person’s face relative to the size of the image
The person’s face must look directly towards the camera. This affects both the class of imported photos that are acceptable and how video cameras must be positioned to capture faces.
Imported images have to meet the requirements listed above. With the exception of mugshots and passport photos, most photos do not meet these requirements.
Video cameras must be positioned specifically to capture faces. The cost may be inexpensive but the skill is not trivial. Care must be taken to precisely position cameras to capture faces consistently.
Resolution
Resolution, as applied to images used in the 3VR system, is defined as the level of visual detail in the image.
Resolution is commonly defined in two dimensions: horizontal and vertical. For instance, a frequent resolution level cited is 640 x 480 pixels. This means that there are 640 unique pixels across the image horizontally and 480 pixels down the image vertically.
The number of horizontal pixels is important for performing 3VR facial surveillance because it determines the amount of detail available for performing facial analysis. Sufficient horizontal pixels are required to perform facial recognition.
One standard metric used in describing video surveillance images is the Common Intermediate Format (CIF). This is a way to quickly cite specific resolution levels that are commonly used in digital video. CIF specifies specific horizontal and vertical resolution levels.
The following table provides examples of different CIF levels.
CIF Level | Resolution | Example |
---|---|---|
Quarter CIF | 176 x 44 |
|
CIF | 352 x 288 | Typical Internet streaming |
2CIF | 704 x 240 |
|
4CIF | 704 x 576 | NTSC camera max resolution |
16CIF | 1408 x 1152 | 1.5 Megapixel camera |
Industry participants commonly cite different CIF levels. These CIF levels have a significant impact on whether imported images or captured video can be used for facial analysis.
To explain further with an example, an image of a person recorded with 4CIF resolution may be able to be used by the 3VR system for facial analysis, however, the same image recorded at CIF resolution may not contain enough detail to be used for facial analysis unless the face is extremely large (at least a quarter of the width of the field of view). See “Camera Placement” on the following page for more information and examples on field of view for facial analysis.
Here is a quick video clip on Facial Surveillance Use Case in Banking
Field of View
Field of View is the size of an area that can be seen while looking through an imager such as a photographic or CCTV camera. Below are two examples of images with different horizontal fields of view.
The difference in the field of view is not very significant – the one on the left is only 2 feet wider than the one on the right. However, despite this seemingly minor difference, we shall show later how this makes a significant difference on the suitability of the images for facial analysis.
An image does not have a singular width for the field of view. The width of the field of view increases as the subject moves from the foreground of the image to the background.
The two major visual differences are:
A subject in the foreground is in significantly clearer focus than the background.
The width of the field of view is significantly smaller in the foreground than in the background.
The variation in the width of the field of view is a product of projecting a 3 dimensional world onto a 2 dimensional medium (the image). All images will always exhibit this property.
This is very important because good facial analysis requires a narrow field of view (less than 4.5 feet at the point at which the subject is in best focus). So, a subject in the foreground of an image that meets the field of view requirements for facial analysis will be too small if they are a step or two further away from the camera, where the field of view for the same camera placement will be
significantly wider. See “Camera Placement” on the following page for more information and examples on field of view for facial analysis.
Camera Placement for Facial Analysis - 4 Factors
Field of View Determines Size and Resolution of Face
Facial analysis requires a certain minimum resolution level to be effective. This resolution level is measured in pixels.
3VR requires a minimum of 35 horizontal pixels between the eyes (or about 80 - 100 horizontal pixels across the head) to perform facial analysis
3VR performs analysis of all analog NTSC video at 4CIF (704 x 576 pixels)
Given these facts and that the average width between eyes is 3”, and the width of a head is approximately 6 - 7”, an NTSC camera at
4CIF resolution can capture faces in a field of view (FOV) no more than about 4.5 feet.
Place a person standing in the foreground in optimal focus. When they hold their arms stretched out from side-to-side, you should not be able to see their hands (the image should be cut off at their wrists). If you can see their hands in the image when they are standing in focus in the foreground, the field of view is too wide for facial analysis.
If you are looking at prerecorded images from a camera already placed, measure the width of the head, and if it is smaller than 1/7 (about 15%) of the field of view, the field of view is too wide for facial analysis.
Poor FOV |
|
---|---|
Feet 5.5; Face: 1/9 of the image |
Max Acceptable FOV |
|
---|---|
Feet 4.5; Face: 1/7 of the image |
Excellent FOV |
|
---|---|
Feet 3.5; Face: 1/6 of the image |
Horizontal Angle |
|
---|---|
Facial analysis requires a clear image of the full face, directly facing the camera - with minimal turning of the head to the left or right (horizontal angle) relative to the camera. |
Vertical Angle
Facial analysis requires a clear image of the full face, directly facing the camera - with minimal tilting of the head up or down (vertical angle) relative to the camera.
Cameras need to be mounted low enough or far away enough so that the vertical angle or slope does not exceed 20% above the eye level when subjects are in focus in the foreground. Given an average eye height of 5 feet, a camera 10 feet away can not be mounted higher than 20% of 10 feet (2 feet) above the eye height of 5 feet - so not higher than 5 + 2 = 7 feet. A camera 20 feet away can be mounted as high as 9 feet (20% of 20 = 4 feet above 5 feet, 4 + 5 = 9). See “Determining Camera Mounting Height” later in this section for more details.
Lighting Level
Facial analysis requires even levels of lighting that clearly shows the detail in a face. Facial analysis requires lighting conditions that do not produce shadows and/or dark areas in the face (underexposure) and lighting conditions that do not produce glare and/or washed-out areas in the face (overexposure). A face with lots of detail visible and a wide range of dark and light pixels (referred to as “dynamic range”) is required for facial analysis.
Photo A –Good Lighting | Image |
---|---|
There are no areas of the face in shadow or glare. There is wide dynamic range - lots of both light and dark areas within the face, and lots of detail is visible. |
Photo B –Overexposed | Image |
---|---|
There are no areas of shadow, but there are significant areas with glare (notice the cheeks, nose and |
Photo C –Marginally acceptable | Image |
---|---|
There are some areas of the face in mild shadow and the face appears somewhat darker than desired. There is marginally acceptable dynamic range - moderate amounts of both light and dark areas within the face, and moderate amounts of detail are visible. |
Photo D – Underexposed | Image |
---|---|
There are substantial areas of shadow and/or not enough light. There is a narrower dynamic range - |
Additional Considerations for Megapixel Cameras
Field of View
With the addition of megapixel cameras to your security solution, you can now utilize the higher resolutions available to ultimately provide a wider field of view. In essence, this allows the use of less cameras and more coverage while still capturing face profiles. It is extremely important to understand that the same principles still apply for facial recognition; this includes pixels in between the eyes, horizontal and vertical angles, and lighting.
he number of horizontal pixels is still the key factor in terms of performing 3VR facial surveillance; the advantage can be noted in the table below.
Resolution | Megapixels | Width for Face (Feet) |
---|---|---|
1024 X 768 | 0.7 | 6.5 |
1280 X 1024 | 1.3 | 8.0 |
1600 X 1200 | 2 | 10.5 |
2048 X 1536 | 3 | 13.5 |
Pixels Between the Eyes
The same principles that apply to lower resolution cameras apply to megapixel cameras; 3VR still requires 35 pixels between the eyes. However, analysis is conducted on the full megapixel frame of the camera’s output. For example, with a resolution of 1280 x 1024, 3VR conducts facial analysis at 1280 x 1024 versus an analog camera at 4CIF (704 x 576).
Image Use and Optimization for Facial Analysis
Overview
3VR can analyze faces from two primary sources:
Digital images such as photographs and mugshots can be imported
CCTV cameras can be connected to a 3VR and video can be continuously analyzed
Imported Images
Any image that is in a supported digital format may be imported into the 3VR system. Supported formats include: BMP, JPEG and GIF. Only faces in images that meet the requirements enumerated in the previous two sections (pixels between the eyes, angles, lighting) may be successfully analyzed by the 3VR system.
Passport Photos/Mugshots
Generally, passport photos and mugshots provide high quality images for facial analysis. The angles and the lighting on these photos are generally within guidelines. Users should check that the pixels between the eyes are sufficient. 3VR uses an import photo process that provides immediate feedback if the number of pixels between the eyes are insufficient.
Images from DVRs
Images from DVRs may not provide sufficient quality for facial analysis. Users should qualify and set appropriate expectations. Video from DVRs are often recorded at resolutions lower than 4CIF such as 2CIF or CIF. Because of the reduction of resolution, the field of view needed for sufficient detail increase significantly. For example, images at CIF resolution would need to have the face in focus in a field of view no wider than 2.25 feet.
Furthermore, the field of view of cameras on traditional DVRs is often 7’ to 12’ which is significantly greater than the 4.5 feet maximum field of view even if the video was recorded at 4CIF.
General Photos from Digital Cameras
Images from digital cameras may provide sufficient quality for facial analysis. To the extent that these photos exhibit the same properties as passport photos, suitability will be more likely.
Photos of people in a group or performing an activity may not suitable if the person’s head is significantly tilted horizontally or vertically. This is quite common and will affect performance.
Digital cameras provide megapixel images. Given the very high resolution of these images (compared to analog photos), the width of the field of view may be very wide (as much as 20’ wide). Users should be careful to verify that the image has not been reduced in resolution. This is a common technique to reduce the file size of an image. However, if this is done it will significantly reduce proportionally the number of pixels between the eyes in faces in the image.
CCTV Video
Many existing camera positions are designed for alarm assessment and activity monitoring. Those cameras must be adapted for use in facial analysis. Existing infrastructure can and may be used for facial analysis but it must be optimized for use in meeting the imaging guidelines enumerated in this document.
The remainder of this section is a lengthy but important treatment of design and deployment options for CCTV video use.
Environment
First, examine the physical layout of the facility to identify natural “choke points” – areas where subjects will appear within a limited field of view and are most likely to look “straight-on” toward the camera(s). Ideal choke points include entrances and hallways. This area should be free from any obstructions that might come between the camera and subject, (including transparent barriers that can create glare/reflection problems). Also, these areas should be devoid of distractions that may entice subject to look away from the camera while passing through. Ideally, subjects will spend 3 seconds passing through a choke-point area.
The recommended width of a single-camera choke point is 4 feet, with a maximum width of 4½ feet. Areas wider than 4½ feet require multiple cameras positioned so the lines of sight slightly overlap. The degree of overlap should enable each camera to capture the full face of a subject who passes halfway between them, instead of each camera capturing only half the subject’s face.
Scene lighting must be sufficient to produce a clear, sharp image. Excessive background lighting, blooming or shadowing conditions must be avoided.
Camera Positioning
Cameras should be centered to increase the opportunities to obtain straight-on (perpendicular) face images. In addition, the distance from the subject should be the greatest allowable (subject to lens specifications that provide a proper Field Of View - less than 4.5 feet, as explained in Section 3). The further the distance between camera and subject, the longer the Depth of Field – the area within which the subject will remain in focus – thus delivering more usable ‘face frames’.
Optimal full-facial recognition occurs when a camera is mounted on a vertical line that is level with (and horizontally perpendicular to), the subject’s face. Harsh camera angles are detrimental, and will seriously degrade the results.
While cameras are often mounted higher than the optimal face-level height, adequate Face Capture can occur provided the maximum Vertical Angle of Incidence is less than a 20% slope, (see illustration below). The greater the distance between camera and subject, the higher the camera can be mounted, while maintaining this threshold.
Determining Camera Mounting Height - No more than 20% slope to eyes in the face
The first step is to determine an average “face-height”. This value is application-specific, (ie: an environment with young children will require a lower value). For this example, we will assume an average face height of 5 feet.
The second step is to measure the “subject distance” (from lens to subject), and mounting height for the camera. In order to maintain a vertical slope of 20%, the camera can be mounted approximately .2 feet above face height for every foot of subject distance as shown below.
Maximum Camera Mounting Height (based on average face height of 5' and a max vertical slop of 20%) |
|
|
|
|
|
---|---|---|---|---|---|
D- Distance to Subject | 10' | 15' | 20’ | 25' | 30' |
.2*D (20% of Distance to Subject) | 2' | 3' | 4' | 5' | 6' |
E- Eye Height | 5' | 5' | 5' | 5' | 5'' |
H: Max Mounting Height: | 7' | 8' | 9' | 10' | 11' |
Simple Formula: (D x .2 ) + E = H
Multiply the subject distance (D) value by .2 (for 20% slope). Add the result to the eye height (E). This equals the maximum mounting height (H).
Example: assuming a 5’ eye height and 15 foot subject distance:
15’ (subject distance) x .2 (slope) = 3’
3’ + 5’ (eye height) = 8’ mounting height
Remember, a 20% slope is a maximum number – a decreased slope will provide better results, (and mounting the camera with no slope, exactly at the average face height will provide the best results).
Example: A potential entry point offers 2 appropriate mounting positions, one at 12 feet away, the other at 24 feet away. In this case, the furthest mounting location is most appropriate.
Camera/Lens Specifications
Cameras designated for face-based recording must comply with a wide range of specifications, primary among them is the use of high-resolution cameras, (480 TVL or greater). The camera and lens must be appropriate to the scene lighting conditions. For example, black & white low-lux or day/night camera for poorly lit areas, and super-dynamic cameras where severe back-lighting can occur.
An appropriate 4’ wide field of view, (FoV), is the result of the focal length of the lens in relation to the object distance from the camera. While an appropriate “fixed focal length” lens can be utilized, a VariFocal Lens is strongly recommended because the selectable focal length range enables a flexible field-adjustment (fine-tuning) capability.
VariFocal Lenses are available with manual or automatic iris features. The automatic iris format is used in applications where lighting conditions may vary, for example, areas exposed to daytime sunlight.
Lens Selection
Typical VariFocal Lenses fall within the following approximate ranges; 3 to 8 millimeters (mm), 3 to 12mm, 5 to 50mm, and 20 to 100mm. The most common and least expensive are the 3 to 8mm versions, although most Face Capture applications will require the larger size VariFocal Lenses.
Guidelines for Determining Lens Focal Length
A 4’ wide FoV requires 1.2mm per foot of distance from subject; a 4½’ wide FoV requires 1.1mm per foot.
FoV | Distance from Object/Focal Length Needed for a Camera with 1/3” CCD | |||||
---|---|---|---|---|---|---|
10' | 20' | 30' | 40' | 50' | 60' | |
4.0 ft.wide | 12mm | 24mm | 36mm | 48mm | 60mm | 72mm |
4.5 ft. wide | 11mm | 22mm | 33mm | 44mm | 55mm | 66mm |
The estimates on the previous page assume the use of 1/3” CCD (the most common size) camera format. Results will vary in applications using different size formats such as ½” or ¼”. Lens calculation tools are available from most camera manufacturers, (online and pocket ‘slide-rule’ versions).
With megapixel cameras, deployment no longer requires such a large telephoto range when comparing to analog and non-megapixel IP cameras. Evidently, this will depend on the particular application. Please refer to the table in the Field of View section for determining the maximum width for each application.
Camera Face Requirement Chart in 3VR
|
|
---|---|
Face Size | 35 pixels between the eyes required. Rules of Thumb: The field of view when subject is in focus must be less than 4.5 feet ot the face, or the face width must be at least 1/7 of the field of view. |
Zoom | Zoom lens recommended: 5-50 mm (5%-15%) angle of view. |
Vertical Slope | Recommended 20% slope from eye level. (see example), rule of thumb: The middle of the nose should be higher or at least the same level as the bottom of the earlobes. |
Horizontal Angle: | Thumb Rule: The image must simultaneously show both ears of the subject. |
Lighting | High quality required with wide dynamic range. Well lit, no harsh shadows or glare |
Event Length | There must be a period of time during which the subject is in focus, large enough (face 1/7 of the field of view), well lit, facing the camera so that both ears are visible, and the middle of the nose is above the bottom of the earlobes. Recommended: 3 seconds, (minimum 2 seconds) |
Camera Features | B/W or Color (appropriate to lighting conditions). Back light control sometime helpful. Manual or Auto-iris lens(as appropriate to the application) |
Examples | Doorways, Hallways, Entrances, Exits Queue lines Teller lines |
Result
Quality Image
With better understanding on optimization of image quality you can now determine:
Better qualify what opportunities are strong fits for 3VR
Set the appropriate level of expectations with partners and users
Design a system that accommodates real world conditions
Deliver a solution that is optimally effective
Deeper understanding of the significant advantages of utilizing megapixel cameras
Remember the following principles as you install cameras for face recognition:
Any image, whether video or photo, requires sufficient detail. Detail is determined by (1) the level of resolution in the image and (2) the size of the person’s face relative to the size of the image.
The person’s face must look directly towards the camera. This affects both the class of imported photos that are acceptable and how video cameras must be positioned to capture faces.
Imported images have to meet the requirements listed above. With the exception of mugshots and passport photos, most photos do not meet these requirements.
Video cameras must be positioned specifically to capture faces. The cost may be inexpensive but the skill is not trivial. Care must be taken to precisely position cameras to capture faces consistently.
Megapixel cameras in your security solution allows a significantly wider field of view but also require more storage and have a negative impact on the system performance if a proper equilibrium is not determined.
It should be understood that determining the maximum mounting height is not affected by the increase in resolution of megapixel cameras. The same formula must be used to stay within the 20% slope specification.