Pepper Facial Recognition

11 min readFeb 27, 2019

Written by Richard

Part of the Pepper series

In my last blog about Pepper I mentioned that I had tried to get Pepper to recognise some of my colleagues and greet them by name. The results were less than impressive to be nice.

Pepper has two documented ways for recognition of persons. One uses face detection and the other is face recognition. Using face recognition entails using Choregraphe to wire the face recognition modules together in an application and triggering the application. I never was able to get Pepper's face recognition to work no matter the lighting conditions.

Face detection works automatically when a face is detected an ID value to that face is generated if everything goes well. It takes a little bit of time for the ID value to be assigned. The value can be either -1, 0, or some integer value. -1 means no face can be recognised. Perhaps the lighting is poor or too far away. 0 means a face was seen but not enough data is available to recognise the face again. Any other integer value means the face has been seen and the data has been collected. This user data can be stored internally and later recalled.

I worked with face detection for a while but soon found that if we were lucky, maybe 40% of the time, Pepper would assign the same ID value to a face he had seen before. But even so, sometimes he would mistake me for somebody else or vice-versa.

I contacted Softbank support to ask about the issues I was having. Essentially they told me I would need to develop my own face detection algorithm. That's just as well I guess because Pepper uses some very old technology for this as well as NLP.

I prefer to have applications installed on Pepper rather than have a laptop controlling Pepper. But the issue with that is that we cannot install packages on Pepper. We do not have root access to Pepper. The best we can hope for is that we can find a 100% pure Python solution so we can include the needed packages in a library folder in a Choregraphe application. If any part of the packages need to be compiled, it cannot be used on Pepper. Pepper uses an Aldebaran flavor of Linux, thus it would have to be compiled on Pepper's OS which is not easily possible. If you are lucky enough to have just .py files, just create a lib directory containing the necessary python files in your application and modify the PYTHONPATH for Pepper in your application to point to that directory in your application.

I have not yet taken a look at all the packages used to determine if everything is pure Python. I know for sure that OpenCV is not 100% pure Python, but I also know that OpenCV is already available on Pepper. If I ever get funding for this project I will see if I can get it installed on Pepper. So for now, that means it is relegated to calling remote Naoqi methods via the qi framework.

This implementation uses the face_recognition library. There are a number of them available, including from PyPi. It is a one shot facial recognition algorithm meaning, you only need one image of a person to be able to recognise them. There is a lot of information on the web on how to use this package so I will not discuss how to configure and train face_recognition. The code and information for face_recongnition can be found at Github and Pyimagesearch as well as other sites and blogs. These are just the two sources I used.

Lucky for me, a version of face recognition was already made for another project we called Doorman (perhaps a blog about this will come later). It has already been trained on many company employees along with their name tags. All I needed to do was to adapt it from using a webcam to Pepper's camera, and having Pepper greet them by name.

Currently, this is just a Proof of Concept for now. Due to GDPR concerns, it may never see the light of day. But still, it was a good exercise and it does indeed perform much better than what is built into Pepper.

A Demo:

Here is a short demo of Pepper recognising me for the first time. It is hard to film and not hide my face at the same time. That is why it took awhile for Pepper to recognise me. On the right you can see my MacBook screen. Although it is small, you can make out the video stream used for the recognition.

Code:

I really dislike Medium's way of doing code blocks. Formatting is important to code so that it is readable. Pasting code into these blocks removes the indentation and spaces the lines too far apart. I am not going to retype all the code so it looks pretty. I will present the code in blocks and try to signal the indentation either through separate blocks or in comments showing where blocks start and end.

As you will notice that in some places camelcase is used and in other places "_" is used to separate the words. Some Python coding tutorials suggest using "_". I prefer not to do that because in essence I am lazy and adding a "_" feels like too much work. There are other places in Python were "_" is used for other reasons. Never mind that camelcase takes two fingers to implement. So does using "_". At least it is one less character to type. It is easy to see what I wrote and what I borrowed.

The original code for the other project included another tag for flight numbers at the request of the client. I have stripped out the flight numbers for this demonstration.

main.py

from naoqi import qi
from naoqi import ALBroker
from naoqi import ALProxy
import face_recognition
import cv2
import time
import os
import sys
import numpy as np
import pickle
import argparse
import vision_definitions
import traceback
from PIL import Image

These are the packages needed to implement the application. Naturally, you will need to install these packages into your Python environment. This will require that the Naoqi Python library is included in your Python path.

IP = "xxx.xxx.xxx.xxx"
PORT = 9559

Define the IP address assigned to Pepper. Both the laptop and Pepper need to be on the same WiFi network.

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--ip", type=str, default=IP, help="Robot IP address. On robot or Local Naoqi: use '127.0.0.1'.")
    parser.add_argument("--port", type=int, default=PORT, help="Naoqi port number: use default '9559'")
    args = parser.parse_args()
    session = qi.Session()    try:
        session.connect("tcp://" + args.ip + ":" + str(args.port))
    except RuntimeError:
        print ("Can't connect to Naoqi at ip \"" + args.ip + "\" on port " + str(args.port) + ".\n" + "Please check your script arguments. Run with -h option for help.")
        sys.exit(1)    idPersons(session, args.ip, args.port)

This is the main method. This is actually at the bottom of the main.py file but I cover this one first before I go on to the idPersons() method.

Argparser is used to parse out any possible command line arguments.

The command line can contain two parameters, the IP address of Pepper and the port to connect to. The constants IP and PORT are the default values if no command line parameters are given.

A qi session is obtained and a connection is attempted. If the connection fails, a failure message is printed on the console and the application ends.

If the session connection is successful, the idPersons() method is called with the configured values for session, IP and PORT.

def idPersons(session, ip=IP, port=PORT):

This is the definition and signature for the idPersons() method. Everything that follows belongs to this block of code.

#begin idPerson block
#Obtain the ALVideoDevice service
videoService = session.service('ALVideoDevice')# subscribe top camera
SID = "pepper_face_recognition"
resolution = vision_definitions.kQVGA
colorSpace = vision_definitions.kRGBColorSpace
nameId = videoService.subscribe(SID, resolution, colorSpace, 10)

The code does the following:

Obtain the Naoqi video service.
Subscribe to the video service for the top camera with the given SID. vision_definitions is a package in the Naoqi SDK. It basically sets the camera to use VGA (640x480) using RGB colors (3 channels).

#Obtain the ALTextToSpeech service
tts = session.service('ALTextToSpeech')

Obtain the ALTextToSpeach service for later use. This is used to make Pepper greet the person.

Now we set up some variables and read in the encodings. You can find everything you need to generate the encodings on the web. There are some good tutorials and code on Github and Pyimagesearch. This code just uses the pickled encodings generated by pretty much the code found at these sites.

known_face_encodings = []
known_face_names = []#Load known face names
with open('encodings_names','rb') as fp:
    known_face_names = pickle.load(fp)#Load known face encodings
with open('encodings','rb') as fp:
    known_face_encodings = pickle.load(fp)

We read the serialized files that we named 'encodings_names' as known_face_names and 'encodings' as known_face_encodings. These are the trained known faces with associated name tags.

# Initialize some variables
face_locations = []
face_encodings = []
face_names = []
process_this_frame = 0
scale=0.5
revscale=1/scale
width = 320
height = 240
blank_image = np.zeros((width,height,3), np.uint8)
image = np.zeros((width,height,3), np.uint8)
greeted = []

Now we initialize some variables we need to process the face detection. Notice we used image size 320 X 240 with a color depth of 3. In OpenCV, color coding is in BRG order, not the customary RGB found everywhere else and with face_recognition itself. The blank_image is used to display a red (access not allowed) or green (access granted) box on the screen.

try:

This is a beginning of a large try/except block.

while True:

Start an infinite loop.

# Grab a single frame of video
try:

The try/except block here is to handle getting an image from Pepper.

result = videoService.getImageRemote(nameId)
image = Noneif result == None:
    print 'cannot capture.'
elif result[6] == None:
    print 'no image data string.'
else:
    image_string = str(result[6])
    im = Image.frombytes("RGB", (width, height), image_string)
    image = np.asarray(im)

try to obtain an image from Pepper
initialize the final image variable to None. The None value will be tested later to determine if we have a valid image.
the result of the attempt to obtain an image from Pepper can result in None or no image string. In that case just print a message of the fact.
if an image was obtained convert the image string to bytes then finally as a numpy array.

except Exception as e:
    print(str(e))
    traceback.print_exc()

If an exception happens, print out what that exception is so we can later debug it.

if not image is None:

This is an if block that covers the rest of the processing. If we got no image, we don't want to do anything and let the while loop begin again.

# Only process every other frame of video to save time
if process_this_frame == 0:

test to see if we should process this frame. We only process half of the captured frames.

# begin processing the frame block
# Resize frame of video to 1/4 size for faster face recognition processing
small_frame = cv2.resize(image, (0, 0), fx=scale, fy=scale)# Find all the faces and face encodings in the current frame of video
face_locations = face_recognition.face_locations(small_frame)
face_encodings = face_recognition.face_encodings(small_frame, face_locations)face_names = []
name="none"for face_encoding in face_encodings:
    # See if the face is a match for the known face(s)
    distances =     face_recognition.api.face_distance(known_face_encodings, face_encoding)    name = "Unknown"
    blank_image[:,:] = (0,0,255)
    distance_min_index = np.argmin(distances)
    distance_min = np.amin(distances)    if distance_min < 0.53:
        name = known_face_names[distance_min_index]
        blank_image[:,:] = (0,255,0)        if not name in greeted:
            tts.say("Hi " + name + " ! Nice to see you.")
            greeted.append(name)
            
    face_names.append(name)
# end of process this frames for loop
# end of if not image is None

There is a lot going on here.

resize the image to 1/4 size for faster processing
get all the locations and encodings for the face(s) in the grabbed image.
initialize an empty face_names, set name to "none" as apposed to "Unknown". If later name remains "none" then there were no faces found in the image.
loop through all the found face encodings
now set name to "Unknown" since we found at least one face.
initialize a red blank image (OpenCV is in BGR)
get the face_recognition distances. Distances are a measure of how far off a face is to known faces.
get the index of the minimum value in distances
get the minimum value of in distances
if the minimum distance value is less than a threshold value, in this case 0.53 then do the following:
1. Obtain the name associated with the face
2. Set the blank image to green
3. If the name found does not already exist in the greeted list then greet the user by name and add the name to the greeted list.
add the name to the face names list

if name == "none":
    blank_image[:,:] = (0,0,255)process_this_frame += 1
process_this_frame = process_this_frame % 2

if name is "none" then make the blank image red
increment the process_this_frame by one and mod it with 2

# Display the results
for (top, right, bottom, left), name in zip(face_locations, face_names):
    # Scale back up face locations since the frame we detected in was scaled to 1/4 size
    top *= revscale
    right *= revscale
    bottom *= revscale
    left *= revscale
    top=int(top)
    right=int(right)
    bottom=int(bottom)
    left=int(left)    # Draw a box around the face
    if name=="Unknown":
        color=(255,0,0)
        write_color=(255,255,255)
    else:
        color=(0,255,0)
        write_color=(0,0,0)
    cv2.rectangle(frame, (left, top), (right, bottom), color, 2)    # Draw a label with a name below the face
    cv2.rectangle(frame, (left, bottom + 70), (right, bottom), color, cv2.FILLED)
    font = cv2.FONT_HERSHEY_DUPLEX
    cv2.putText(frame, name, (left + 6, bottom + 29), font, 1.0, write_color, 1)# Display the resulting image
# Convert the image to BGR color (which OpenCV uses) from RGB color (which face_recognition uses)
bgr_image = image[:, :, ::-1]
frame_resized = cv2.resize(bgr_image, (0,0), fx=0.75, fy=0.75)
cv2.imshow('Video', frame_resized)
cv2.imshow('Access', blank_image)# Hit 'q' on the keyboard to quit!
if cv2.waitKey(1) & 0xFF == ord('q'):
    breakif cv2.waitKey(1) & 0xFF == ord('r'):
    with open('encodings_names','rb') as fp:
        known_face_names = pickle.load(fp)
    with open('encodings','rb') as fp:
        known_face_encodings = pickle.load(fp)
    print('reread encodings')# End of while loop block

Displaying the images should be pretty straight forward. For all the bounding boxes and names found:

resize the image
draw a bounding box around the image (red for unknown, green for known)
draw the label (red with white text for unknown, green with black text for known)
display the resulting image reduced by 1/4
if a 'q' key is pressed then break out of the while loop
if a 'r' key is pressed then reload the initial encoding files.

cv2.destroyAllWindows()

Once the q key is pressed, close all OpenCV windows.

# end of the top most try block
except Exception as e:
    print(str(e))
    traceback.print_exc()

End the top most try block with an except block.

Conclusion:

I don't give all the answers in this blog. You will need to read and experiment with the face_recognition package to train it with your own images.

This blog shows how to use face_recognition along with how to capture video frames with Pepper and how to make him say something.

In practice, one might show the resulting video frames on Pepper's tablet, although that is not really necessary. One could also make Pepper perform various greetings using the detected person's name. Maybe even greet the strangers and invite them to identify themselves so that Pepper can recognise them at a later time but that operation has issues of its own.