The automatic recognition of the gesture of the hand based on vision has been a very active research topic in recent years with motivating applications such as interaction with the human computer (HCI), robot control and interpretation of sign language.

The general problem is quite challenging due to a number of problems, including the complicated nature of static and dynamic hand gestures, complex backgrounds and occlusions.

Solving this problem requires elaborate algorithms that need advanced computing resources.

What has motivated us to do this practice is to be able to create a system of mathematical operations simply by using hand gestures as numbers (counters). In this way we will be able to work with the smallest students on the real-time solutions of addition, subtraction and multiplication using the gestures.

The first ideas to solve the recognition of hand gestures in a context of control through a camera involved the use of markers at the tips of the fingers. An associated algorithm is used to detect the presence and color of markers, through which you can identify which fingers are active in the gesture. This first practice can be done even with Arduino robotic technology, using its own sensors.

The inconvenience of placing markers in the hands of students or teachers makes this an impractical approach in practice for the deployment of hardware and usability in small students.

Today, more advanced computer vision techniques are used and do not require markers. The recognition of the hand gesture is made by a curvature space method that involves finding the contours of the hand's limits. This is a robust approach that is the scale, the translation and the invariable rotation in the position of the hand, however, it is computationally very advanced and impossible to take it to the classroom.

Therefore after studying different options, we propose a technique of recognition of hand posture based on vision that uses skeletal images, in which a multi-system camera (webcam) is used to select the center of gravity of the hand and points with the distances furthest from the center, providing the locations of the fingertips, which are then used to obtain a skeleton image, and finally for the recognition of gestures.

We want to get an algorithm that allows to identify a sign of hand pose in the input image, as one of five possible commands (or counts). The identified command will then be used as a control entry to perform a sum or other operation. That gesture that recognizes a number will be saved as a numeric variable.

We will explain the different stages by which we are able to recognize the gesture of the hand and identify the number.

  • Location of regions similar to the hands according to the skin color statistics learned, which produce an image output in Black and White.
  • Perform segmentation based on the region of the hand, eliminating the small regions of “false alarm” that were declared as “hand-like”, based on their color statistics.
  • Calculation of the center of gravity of the region of the hand, as well as the farthest distance in the region of the hand from the center of gravity.
  • Construct a circle centered on the center of gravity that intersects all the fingers that are active in the count.
  • Extraction of a binary signal following the circle and classifying the gesture of the hand according to the number of active regions (fingers) in the signal.

In the following points we will explain in a more detailed way how we have made the most complex points of the practice.

Suppose that part of the scene around the hand has already been removed. So, our first task is to segment the hand in the image from the bottom. We will achieve that goal in two steps.

First, we find the pixels in the scene that probably belong to the region of the hand, which we describe in this section. Then, we refine that result, as described below.

An important fact that we should know is that the red / green ratio (R / G) is a characteristic discriminative feature of the color of human skin.

Let's explain it in a practical way !!

We show some images we have done, each with a hand gesture, along with scatter diagrams of the "Red & Green" components of the pixel intensities for the skin and skinless regions in the images.

We see that the Red / Green relationship stays within a narrow band of values ​​for skin pixels, while it is much more variable for non-skin pixels. Therefore, we could use this relationship to decide if a pixel is likely to belong to the region of the hand or not.

So let's use this "thresholding" scheme, to set all the pixels with color intensities within the thresholds in one, and all the rest in zero; resulting in a black and white image output.

Of course, this simple scheme could produce many wrong decisions, for example, many background pixels with skin-like colors could be classified as "hand-like."

We will try to fine tune this output in the following steps:

2.2. Segmentation and elimination of false region

To refine these results we will assume that the largest connected white region corresponds to the hand. Then we use a relative region size threshold to eliminate unwanted regions.

In particular, we eliminate regions that contain a number of pixels smaller than a threshold value.

The threshold value is chosen as the 20% of the total number of pixels in the white parts.

How do we find the center and the farthest distance?

Given the region of the segmented hand, we calculate its center of gravity (COG), (x, y), as follows:

Where xi and yi are x and y coordinates of the ith pixel in the region of the hand, and k denotes the number of pixels in the region.

After obtaining the COG, we calculate the distance from the most extreme point of the hand to the center; normally, this farthest distance is the distance from the center of gravity to the tip of the longest active finger in the gesture.

2.4. Building a Circle

We draw a circle whose radius is 0,7 from the distance furthest from the COG.

It is likely that this circle cuts all the active fingers in a gesture.

2.5. Extracting a signal to classify the gesture

Now we extract a binary signal following the circle built in the previous step.

Ideally, the “white” uninterrupted parts of this signal correspond to the fingers or wrist. Counting the number of transitions from zero to one (black to white) in this signal, and subtracting one (for the wrist) leads to the estimated number of active fingers in the gesture.

Knowing the number of fingers leads to recognition of the gesture.

An important aspect that we have achieved with this algorithm is that it only counts the number of active fingers regardless of which particular fingers are active.

For example, let's explain it with these images:

The image shows three different ways in which our algorithm would recognize a three-finger count; rotation, orientation or any other combination of three fingers would also give the same result.


Therefore, the program does not have to remember which three fingers you need to use to express the "three" as a number.

2.6. Scale or rotation is not a problem

Another important factor that this algorithm allows us is that it is invariant in scale in the recognition of gestures, that is, we could place different sizes of hands in the recognition boxes and we would have the same result. Even zoom in or out on the camera to play with better angles inside the classroom.


This photo shows a result of a hand image showing 2 fingers. We show the result of several stages of our algorithm, we obtain the transitions from zero to one in the signal. The number of these transitions minus one (for the wrist) produces the estimated count.

Some limitations.

Especially with the webcam they have led to incorrect results. In these cases, the failure is mainly due to the erroneous segmentation of some background parts such as the region of the hand. Our algorithm seems to work well with somewhat complicated backgrounds, as long as there are not too many pixels in the background with colors similar to those of the skin.


  • [1] J. Davis and M. Shah "Visual Gesture Recognition", IEE Proc.-Vis. Image Signal Process., Vol. 141, No.2, April 1994.
  • [2] C.-C. Chang, I.-Y Chen, and Y.-S. Huang, "Hand Pose Recognition Using Curvature Scale Space", IEEE International Conference on Pattern Recognition, 2002.
  • [3] A. Utsumi, T. Miyasato and F. Kishino, "Multi-Camera Hand Pose Recognition System Using Skeleton Image", IEEE International Workshop on Robot and Human Communication, pp. 219-224, 1995.
  • [4] Y. Aoki, S. Tanahashi, and J. Xu, "Sign Language Image Processing for Intelligent Communication by Communication Satellite", IEEE International Conf. On Acoustics, Speech, and Signal Processing, 1994.
  • [5] R. Rosales, V. Athitsos, L. Sigal, and S. Sclaroff, "3D Hand Pose Reconstruction Using Specialized Mappings", IEEE International Con. On Computer Vision, pp. 378- 385, 2001.
  • [6] C. Tomasi, S. Petrov, and A. Sastry, "3D = Classification + Interpolation", IEEE International Conf. On Computer Vision, 2003.
  • [7] WT Freeman and M. Roth, "Orientation Histograms for Hand Gesture Recognition", IEEE International Conf. On Automatic Face and Gesture Recognition, 1995.
  • [8] L. Bretzner, I. Laptev, and T. Lindberg, "Hand Gesture Recognition using Multi-Scale Color Features, Hierarchical Models and Particle Filtering", IEEE International Conf. On Automatic Face and Gesture Recognition, 2002.
  • [9] J. Brand and J. Mason, "A Comparative Assessment of Three Approaches to Pixel-level Human Skin Detection," IEEE International Conference on Pattern Recognition, 2000.
  • [10] RC Gonzalez and RE Woods, Digital Image Processing, Prentice-Hall, 2nd edition, 2002.
  • [11]