| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991992993994995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038103910401041104210431044104510461047104810491050105110521053105410551056105710581059106010611062106310641065106610671068106910701071107210731074107510761077107810791080108110821083 |
- \documentclass[twoside,openright]{uva-bachelor-thesis}
- \usepackage[english]{babel}
- \usepackage[utf8]{inputenc}
- \usepackage{hyperref,graphicx,tikz,subfigure,float}
- % Link colors
- \hypersetup{colorlinks=true,linkcolor=black,urlcolor=blue,citecolor=DarkGreen}
- % Title Page
- \title{A generic architecture for gesture-based interaction}
- \author{Taddeüs Kroes}
- \supervisors{Dr. Robert G. Belleman (UvA)}
- \signedby{Dr. Robert G. Belleman (UvA)}
- \begin{document}
- % Title page
- \maketitle
- \begin{abstract}
- Applications that use complex gesture-based interaction need to translate
- primitive messages from low-level device drivers to complex, high-level
- gestures, and map these gestures to elements in an application. This report
- presents a generic architecture for the detection of complex gestures in an
- application. The architecture translates device driver messages to a common
- set of ``events''. The events are then delegated to a tree of ``event
- areas'', which are used to separate groups of events and assign these
- groups to an element in the application. Gesture detection is performed on
- a group of events assigned to an event area, using detection units called
- ``gesture tackers''. An implementation of the architecture as a daemon
- process would be capable of serving gestures to multiple applications at
- the same time. A reference implementation and two test case applications
- have been created to test the effectiveness of the architecture design.
- \end{abstract}
- % Set paragraph indentation
- \parindent 0pt
- \parskip 1.5ex plus 0.5ex minus 0.2ex
- % Table of content on separate page
- \tableofcontents
- \chapter{Introduction}
- \label{chapter:introduction}
- Surface-touch devices have evolved from pen-based tablets to single-touch
- trackpads, to multi-touch devices like smartphones and tablets. Multi-touch
- devices enable a user to interact with software using hand gestures, making the
- interaction more expressive and intuitive. These gestures are more complex than
- primitive ``click'' or ``tap'' events that are used by single-touch devices.
- Some examples of more complex gestures are ``pinch''\footnote{A ``pinch''
- gesture is formed by performing a pinching movement with multiple fingers on a
- multi-touch surface. Pinch gestures are often used to zoom in or out on an
- object.} and ``flick''\footnote{A ``flick'' gesture is the act of grabbing an
- object and throwing it in a direction on a touch surface, giving it momentum to
- move for some time after the hand releases the surface.} gestures.
- The complexity of gestures is not limited to navigation in smartphones. Some
- multi-touch devices are already capable of recognizing objects touching the
- screen \cite[Microsoft Surface]{mssurface}. In the near future, touch screens
- will possibly be extended or even replaced with in-air interaction (Microsoft's
- Kinect \cite{kinect} and the Leap \cite{leap}).
- The interaction devices mentioned above generate primitive events. In the case
- of surface-touch devices, these are \emph{down}, \emph{move} and \emph{up}
- events. Application programmers who want to incorporate complex, intuitive
- gestures in their application face the challenge of interpreting these
- primitive events as gestures. With the increasing complexity of gestures, the
- complexity of the logic required to detect these gestures increases as well.
- This challenge limits, or even deters the application developer to use complex
- gestures in an application.
- The main question in this research project is whether a generic architecture
- for the detection of complex interaction gestures can be designed, with the
- capability of managing the complexity of gesture detection logic. The ultimate
- goal would be to create an implementation of this architecture that can be
- extended to support a wide range of complex gestures. With the existence of
- such an implementation, application developers do not need to reinvent gesture
- detection for every new gesture-based application.
- \section{Contents of this document}
- The scope of this thesis is limited to the detection of gestures on
- multi-touch surface devices. It presents a design for a generic gesture
- detection architecture for use in multi-touch based applications. A
- reference implementation of this design is used in some test case
- applications, whose purpose is to test the effectiveness of the design and
- detect its shortcomings.
- Chapter \ref{chapter:related} describes related work that inspired the
- design of the architecture. The design is described in chapter
- \ref{chapter:design}. Chapter \ref{chapter:implementation} presents a
- reference implementation of the architecture. Two test case applications
- show the practical use of the architecture components in chapter
- \ref{chapter:test-applications}. Chapter \ref{chapter:conclusions}
- formulates some conclusions about the architecture design and its
- practicality. Finally, some suggestions for future research on the subject
- are given in chapter \ref{chapter:futurework}.
- \chapter{Related work}
- \label{chapter:related}
- Applications that use gesture-based interaction need a graphical user
- interface (GUI) on which gestures can be performed. The creation of a GUI
- is a platform-specific task. For instance, Windows and Linux support
- different window managers. To create a window in a platform-independent
- application, the application would need to include separate functionalities
- for supported platforms. For this reason, GUI-based applications are often
- built on top of an application framework that abstracts platform-specific
- tasks. Frameworks often include a set of tools and events that help the
- developer to easily build advanced GUI widgets.
- % Existing frameworks (and why they're not good enough)
- Some frameworks, such as Nokia's Qt \cite{qt}, provide support for basic
- multi-touch gestures like tapping, rotation or pinching. However, the
- detection of gestures is embedded in the framework code in an inseparable
- way. Consequently, an application developer who wants to use multi-touch
- interaction in an application, is forced to use an application framework
- that includes support for those multi-touch gestures that are required by
- the application. Kivy \cite{kivy} is a GUI framework for Python
- applications, with support for multi-touch gestures. It uses a basic
- gesture detection algorithm that allows developers to define custom
- gestures to some degree \cite{kivygesture} using a set of touch point
- coordinates. However, these frameworks do not provide support for extension
- with custom complex gestures.
- Many frameworks are also device-specific, meaning that they are developed
- for use on either a tablet, smartphone, PC or other device. OpenNI
- \cite{OpenNI2010}, for example, provides API's for only natural interaction
- (NI) devices such as webcams and microphones. The concept of complex
- gesture-based interaction, however, is applicable to a much wider set of
- devices. VRPN \cite{VRPN} provides a software library that abstracts the
- output of devices, which enables it to support a wide set of devices used
- in Virtual Reality (VR) interaction. The framework makes the low-level
- events of these devices accessible in a client application using network
- communication. Gesture detection is not included in VRPN.
- % Methods of gesture detection
- The detection of high-level gestures from low-level events can be
- approached in several ways. GART \cite{GART} is a toolkit for the
- development of gesture-based applications, which states that the best way
- to classify gestures is to use machine learning. The programmer trains an
- application to recognize gestures using a machine learning library from the
- toolkit. Though multi-touch input is not directly supported by the toolkit,
- the level of abstraction does allow for it to be implemented in the form of
- a ``touch'' sensor. The reason to use machine learning is that gesture
- detection ``is likely to become increasingly complex and unmanageable''
- when using a predefined set of rules to detect whether some sensor input
- can be classified as a specific gesture.
- The alternative to machine learning is to define a predefined set of rules
- for each gesture. Manoj Kumar \cite{win7touch} presents a Windows 7
- application, written in Microsofts .NET, which detects a set of basic
- directional gestures based on the movement of a stylus. The complexity of
- the code is managed by the separation of different gesture types in
- different detection units called ``gesture trackers''. The application
- shows that predefined gesture detection rules do not necessarily produce
- unmanageable code.
- \section{Analysis of related work}
- Implementations for the support of complex gesture based interaction do
- already exist. However, gesture detection in these implementations is
- device-specific (Nokia Qt and OpenNI) or limited to use within an
- application framework (Kivy).
- An abstraction of device output allows VRPN and GART to support multiple
- devices. However, VRPN does not incorporate gesture detection. GART does,
- but only in the form of machine learning algorithms. Many applications for
- mobile phones and tablets only use simple gestures such as taps. For this
- category of applications, machine learning is an excessively complex method
- of gesture detection. Manoj Kumar shows that if managed well, a predefined
- set of gesture detection rules is sufficient to detect simple gestures.
- This thesis explores the possibility to create an architecture that
- combines support for multiple input devices with different methods of
- gesture detection.
- \chapter{Design}
- \label{chapter:design}
- % Diagrams are defined in a separate file
- \input{data/diagrams}
- \section{Introduction}
- Application frameworks are a necessity when it comes to fast,
- cross-platform development. A generic architecture design should aim to be
- compatible with existing frameworks, and provide a way to detect and extend
- gestures independent of the framework. Since an application framework is
- written in a specific programming language, the architecture should be
- accessible for applications using a language-independent method of
- communication. This intention leads towards the concept of a dedicated
- gesture detection application that serves gestures to multiple applications
- at the same time.
- This chapter describes a design for such an architecture. The architecture
- components are shown by figure \ref{fig:fulldiagram}. Sections
- \ref{sec:multipledrivers} to \ref{sec:daemon} explain the use of all
- components in detail.
- \fulldiagram
- \newpage
- \section{Supporting multiple drivers}
- \label{sec:multipledrivers}
- The TUIO protocol \cite{TUIO} is an example of a driver that can be used by
- multi-touch devices. TUIO uses ALIVE- and SET-messages to communicate
- low-level touch events (section \ref{sec:tuio} describes these in more
- detail). These messages are specific to the API of the TUIO protocol.
- Other drivers may use different messages types. To support more than one
- driver in the architecture, there must be some translation from
- device-specific messages to a common format for primitive touch events.
- After all, the gesture detection logic in a ``generic'' architecture should
- not be implemented based on device-specific messages. The event types in
- this format should be chosen so that multiple drivers can trigger the same
- events. If each supported driver would add its own set of event types to
- the common format, the purpose of it being ``common'' would be defeated.
- A minimal expectation for a touch device driver is that it detects simple
- touch points, with a ``point'' being an object at an $(x, y)$ position on
- the touch surface. This yields a basic set of events: $\{point\_down,
- point\_move, point\_up\}$.
- The TUIO protocol supports fiducials\footnote{A fiducial is a pattern used
- by some touch devices to identify objects.}, which also have a rotational
- property. This results in a more extended set: $\{point\_down, point\_move,
- point\_up, object\_down, object\_move, object\_up,\\ object\_rotate\}$.
- Due to their generic nature, the use of these events is not limited to the
- TUIO protocol. Another driver that can keep apart rotated objects from
- simple touch points could also trigger them.
- The component that translates device-specific messages to common events, is
- called the \emph{event driver}. The event driver runs in a loop, receiving
- and analyzing driver messages. When a sequence of messages is analyzed as
- an event, the event driver delegates the event to other components in the
- architecture for translation to gestures.
- Support for a touch driver can be added by adding an event driver
- implementation. The choice of event driver implementation that is used in an
- application is dependent on the driver support of the touch device being
- used.
- Because event driver implementations have a common output format in the
- form of events, multiple event drivers can be used at the same time (see
- figure \ref{fig:multipledrivers}). This design feature allows low-level
- events from multiple devices to be aggregated into high-level gestures.
- \multipledriversdiagram
- \section{Event areas: connecting gesture events to widgets}
- \label{sec:areas}
- Touch input devices are unaware of the graphical input
- widgets\footnote{``Widget'' is a name commonly used to identify an element
- of a graphical user interface (GUI).} rendered by an application, and
- therefore generate events that simply identify the screen location at which
- an event takes place. User interfaces of applications that do not run in
- full screen modus are contained in a window. Events which occur outside the
- application window should not be handled by the application in most cases.
- What's more, a widget within the application window itself should be able
- to respond to different gestures. E.g. a button widget may respond to a
- ``tap'' gesture to be activated, whereas the application window responds to
- a ``pinch'' gesture to be resized. In order to restrict the occurrence of a
- gesture to a particular widget in an application, the events used for the
- gesture must be restricted to the area of the screen covered by that
- widget. An important question is if the architecture should offer a
- solution to this problem, or leave the task of assigning gestures to
- application widgets to the application developer.
- If the architecture does not provide a solution, the ``gesture detection''
- component in figure \ref{fig:fulldiagram} receives all events that occur on
- the screen surface. The gesture detection logic thus uses all events as
- input to detect a gesture. This leaves no possibility for a gesture to
- occur at multiple screen positions at the same time. The problem is
- illustrated by figure \ref{fig:ex1}, where two widgets on the screen can be
- rotated independently. The rotation detection component that detects
- rotation gestures receives events from all four fingers as input. If the
- two groups of events are not separated by clustering them based on the area
- in which they are placed, only one rotation event will occur.
- \examplefigureone
- A gesture detection component could perform a heuristic way of clustering
- based on the distance between events. However, this method cannot guarantee
- that a cluster of events corresponds to a particular application widget.
- In short, a gesture detection component is difficult to implement without
- awareness of the location of application widgets. Secondly, the
- application developer still needs to direct gestures to a particular widget
- manually. This requires geometric calculations in the application logic,
- which is a tedious and error-prone task for the developer.
- The architecture described here groups events that occur inside the area
- covered by a widget, before passing them on to a gesture detection
- component. Different gesture detection components can then detect gestures
- simultaneously, based on different sets of input events. An area of the
- screen surface is represented by an \emph{event area}. An event area
- filters input events based on their location, and then delegates events to
- gesture detection components that are assigned to the event area. Events
- which are located outside the event area are not delegated to its gesture
- detection components.
- In the example of figure \ref{fig:ex1}, the two rotatable widgets can be
- represented by two event areas, each having a different rotation detection
- component. Each event area can consist of four corner locations of the
- square it represents. To detect whether an event is located inside a
- square, the event areas can use a point-in-polygon (PIP) test \cite{PIP}.
- It is the task of the client application to synchronize the corner
- locations of the event area with those of the widget.
- \subsection{Callback mechanism}
- When a gesture is detected by a gesture detection component, it must be
- handled by the client application. A common way to handle events in an
- application is a ``callback'' mechanism: the application developer binds a
- function to an event, that is called when the event occurs. Because of the
- familiarity of this concept with developers, the architecture uses a
- callback mechanism to handle gestures in an application. Callback handlers
- are bound to event areas, since event areas control the grouping of events
- and thus the occurrence of gestures in an area of the screen.
- \subsection{Area tree}
- \label{sec:tree}
- A basic data structure of event areas in the architecture would be a list
- of event areas. When the event driver delegates an event, it is accepted by
- each event area that contains the event coordinates.
- If the architecture were to be used in combination with an application
- framework, each widget that responds to gestures should have a mirroring
- event area that synchronizes its location with that of the widget. Consider
- a panel with five buttons that all listen to a ``tap'' event. If the
- location of the panel changes as a result of movement of the application
- window, the positions of all buttons have to be updated too.
- This process is simplified by the arrangement of event areas in a tree
- structure. A root event area represents the panel, containing five other
- event areas which are positioned relative to the root area. The relative
- positions do not need to be updated when the panel area changes its
- position. GUI toolkits use this kind of tree structure to manage graphical
- widgets.
- If the GUI toolkit provides an API for requesting the position and size of
- a widget, a recommended first step when developing an application is to
- create a subclass of the area that automatically synchronizes with the
- position of a widget from the GUI framework. For example, the test
- application described in section \ref{sec:testapp} extends the GTK+
- \cite{GTK} application window widget with the functionality of a
- rectangular event area, to direct touch events to an application window.
- \subsection{Event propagation}
- \label{sec:eventpropagation}
- Another problem occurs when event areas overlap, as shown by figure
- \ref{fig:eventpropagation}. When the white square is dragged, the gray
- square should stay at its current position. This means that events that are
- used for dragging of the white square, should not be used for dragging of
- the gray square. The use of event areas alone does not provide a solution
- here, since both the gray and the white event area accept an event that
- occurs within the white square.
- The problem described above is a common problem in GUI applications, and
- there is a common solution (used by GTK+ \cite{gtkeventpropagation}, among
- others). An event is passed to an ``event handler''. If the handler returns
- \texttt{true}, the event is considered ``handled'' and is not
- ``propagated'' to other widgets. Applied to the example of the draggable
- squares, the rotation detection component of the white square should stop
- the propagation of events to the event area of the gray square.
- In the example, rotation of the white square has priority over rotation of
- the gray square because the white area is the widget actually being touched
- at the screen surface. In general, events should be delegated to event
- areas according to the order in which the event areas are positioned over
- each other. The tree structure in which event areas are arranged, is an
- ideal tool to determine the order in which an event is delegated. An
- object touching the screen is essentially touching the deepest event area
- in the tree that contains the triggered event, which must be the first to
- receive the event. When the gesture trackers of the event area are
- finished with the event, it is propagated to the parent and siblings in the
- event area tree. Optionally, a gesture tracker can stop the propagation of
- the event by its corresponding event area. Figure
- \ref{fig:eventpropagation} demonstrates event propagation in the example of
- the draggable squares.
- \eventpropagationfigure
- An additional type of event propagation is ``immediate propagation'', which
- indicates propagation of an event from one gesture tracker to another. This
- is applicable when an event area uses more than one gesture tracker. When
- regular propagation is stopped, the event is propagated to other gesture
- trackers first, before actually being stopped. One of the gesture trackers
- can also stop the immediate propagation of an event, so that the event is
- not passed to the next gesture tracker, nor to the ancestors of the event
- area.
- The concept of an event area is based on the assumption that the set of
- originating events that form a particular gesture, can be determined
- exclusively based on the location of the events. This is a reasonable
- assumption for simple touch objects whose only parameter is a position,
- such as a pen or a human finger. However, more complex touch objects can
- have additional parameters, such as rotational orientation or color. An
- even more generic concept is the \emph{event filter}, which detects whether
- an event should be assigned to a particular gesture detection component
- based on all available parameters. This level of abstraction provides
- additional methods of interaction. For example, a camera-based multi-touch
- surface could make a distinction between gestures performed with a blue
- gloved hand, and gestures performed with a green gloved hand.
- As mentioned in the introduction (chapter \ref{chapter:introduction}), the
- scope of this thesis is limited to multi-touch surface based devices, for
- which the \emph{event area} concept suffices. Section \ref{sec:eventfilter}
- explores the possibility of event areas to be replaced with event filters.
- \section{Detecting gestures from low-level events}
- \label{sec:gesture-detection}
- The low-level events that are grouped by an event area must be translated
- to high-level gestures in some way. Simple gestures, such as a tap or the
- dragging of an element using one finger, are easy to detect by comparing
- the positions of sequential $point\_down$ and $point\_move$ events. More
- complex gestures, like the writing of a character from the alphabet,
- require more advanced detection algorithms.
- Sequences of events that are triggered by a multi-touch based surfaces are
- often of a manageable complexity. An imperative programming style is
- sufficient to detect many common gestures, like rotation and dragging. The
- imperative programming style is also familiar and understandable for a wide
- range of application developers. Therefore, the architecture should support
- this style of gesture detection. A problem with an imperative programming
- style is that the explicit detection of different gestures requires
- different gesture detection components. If these components are not managed
- well, the detection logic is prone to become chaotic and over-complex.
- A way to detect more complex gestures based on a sequence of input events,
- is with the use of machine learning methods, such as the Hidden Markov
- Models (HMM)\footnote{A Hidden Markov Model (HMM) is a statistical model
- without a memory, it can be used to detect gestures based on the current
- input state alone.} used for sign language detection by Gerhard Rigoll et
- al. \cite{conf/gw/RigollKE97}. A sequence of input states can be mapped to
- a feature vector that is recognized as a particular gesture with a certain
- probability. An advantage of using machine learning with respect to an
- imperative programming style, is that complex gestures are described
- without the use of explicit detection logic, thus reducing code complexity.
- For example, the detection of the character `A' being written on the screen
- is difficult to implement using explicit detection code, whereas a trained
- machine learning system can produce a match with relative ease.
- To manage complexity and support multiple styles of gesture detection
- logic, the architecture has adopted the tracker-based design as described
- by Manoj Kumar \cite{win7touch}. Different detection components are wrapped
- in separate gesture tracking units called \emph{gesture trackers}. The
- input of a gesture tracker is provided by an event area in the form of
- events. Each gesture detection component is wrapped in a gesture tracker
- with a fixed type of input and output. Internally, the gesture tracker can
- adopt any programming style. A character recognition component can use an
- HMM, whereas a tap detection component defines a simple function that
- compares event coordinates.
- When a gesture tracker detects a gesture, this gesture is triggered in the
- corresponding event area. The event area then calls the callback functions
- that are bound to the gesture type by the application.
- The use of gesture trackers as small detection units allows extension of
- the architecture. A developer can write a custom gesture tracker and
- register it in the architecture. The tracker can use any type of detection
- logic internally, as long as it translates low-level events to high-level
- gestures.
- An example of a possible gesture tracker implementation is a
- ``transformation tracker'' that detects rotation, scaling and translation
- gestures.
- \section{Serving multiple applications}
- \label{sec:daemon}
- The design of the architecture is essentially complete with the components
- specified in this chapter. However, one specification has not yet been
- discussed: the ability to address the architecture using a method of
- communication independent of the application's programming language.
- If the architecture and a gesture-based application are written in the same
- language, the main loop of the architecture can run in a separate thread of
- the application. If the application is written in a different language, the
- architecture has to run in a separate process. Since the application needs
- to respond to gestures that are triggered by the architecture, there must
- be a communication layer between the separate processes.
- A common and efficient way of communication between two separate processes
- is through the use of a network protocol. The architecture could run as a
- daemon\footnote{``daemon'' is a name Unix uses to indicate that a process
- runs as a background process.} process, listening to driver messages and
- triggering gestures in registered applications.
- \daemondiagram
- An advantage of a daemon setup is that it can serve multiple applications
- at the same time. Alternatively, each application that uses gesture
- interaction would start its own instance of the architecture in a separate
- process, which would be less efficient. The network communication layer
- also allows the architecture and a client application to run on separate
- machines, thus distributing computational load. The other machine may even
- use a different operating system.
- \chapter{Reference implementation}
- \label{chapter:implementation}
- A reference implementation of the design has been written in Python and is
- available at \cite{gitrepos}. The implementation does not include a network
- protocol to support the daemon setup as described in section \ref{sec:daemon}.
- Therefore, it is only usable in Python programs. The two test applications
- described in chapter \ref{chapter:test-applications} are also written in
- Python.
- To test multi-touch interaction properly, a multi-touch device is required. The
- University of Amsterdam (UvA) has provided access to a multi-touch table from
- PQlabs. The table uses the TUIO protocol \cite{TUIO} to communicate touch
- events.
- The following component implementations are included in the implementation:
- \textbf{Event drivers}
- \begin{itemize}
- \item TUIO driver, using only the support for simple touch points with an
- $(x, y)$ position.
- \end{itemize}
- \textbf{Event areas}
- \begin{itemize}
- \item Circular area
- \item Rectangular area
- \item Polygon area
- \item Full screen area
- \end{itemize}
- \textbf{Gesture trackers}
- \begin{itemize}
- \item Basic tracker, supports $point\_down,~point\_move,~point\_up$ gestures.
- \item Tap tracker, supports $tap,~single\_tap,~double\_tap$ gestures.
- \item Transformation tracker, supports $rotate,~pinch,~drag,~flick$ gestures.
- \end{itemize}
- The implementation of the TUIO event driver is described in section
- \ref{sec:tuio}.
- The reference implementation also contains some geometric functions that are
- used by several event area implementations. The event area implementations are
- trivial by name and are therefore not discussed in this report.
- All gesture trackers have been implemented using an imperative programming
- style. Section \ref{sec:tracker-registration} shows how gesture trackers can be
- added to the architecture. Sections \ref{sec:basictracker} to
- \ref{sec:transformationtracker} describe the gesture tracker implementations in
- detail.
- \section{The TUIO event driver}
- \label{sec:tuio}
- The TUIO protocol \cite{TUIO} defines a way to geometrically describe tangible
- objects, such as fingers or objects on a multi-touch table. Object information
- is sent to the TUIO UDP port (3333 by default). For efficiency reasons, the
- TUIO protocol is encoded using the Open Sound Control \cite[OSC]{OSC} format.
- An OSC server/client implementation is available for Python: pyOSC
- \cite{pyOSC}.
- A Python implementation of the TUIO protocol also exists: pyTUIO \cite{pyTUIO}.
- However, a bug causes the execution of an example script to yield an error in
- Python's built-in \texttt{socket} library. Therefore, the TUIO event driver
- receives TUIO messages at a lower level, using the pyOSC package to receive
- TUIO messages.
- The two most important message types of the protocol are ALIVE and SET
- messages. An ALIVE message contains the list of ``session'' id's that are
- currently ``active'', which in the case of multi-touch a table means that they
- are touching the touch surface. A SET message provides geometric information of
- a session, such as position, velocity and acceleration. Each session represents
- an object touching the touch surface. The only type of objects on the
- multi-touch table are what the TUIO protocol calls ``2DCur'', which is a (x, y)
- position on the touch surface.
- ALIVE messages can be used to determine when an object touches and releases the
- screen. E.g. if a session id was in the previous message but not in the
- current, the object it represents has been lifted from the screen. SET messages
- provide information about movement. In the case of simple (x, y) positions,
- only the movement vector of the position itself can be calculated. For more
- complex objects such as fiducials, arguments like rotational position and
- acceleration are also included. ALIVE and SET messages are combined to create
- \emph{point\_down}, \emph{point\_move} and \emph{point\_up} events by the TUIO
- event driver.
- TUIO coordinates range from $0.0$ to $1.0$, with $(0.0, 0.0)$ being the left
- top corner of the touch surface and $(1.0, 1.0)$ the right bottom corner. The
- TUIO event driver scales these to pixel coordinates so that event area
- implementations can use pixel coordinates to determine whether an event is
- located within them. This transformation is also mentioned by the online
- TUIO specification \cite{TUIO_specification}:
- \begin{quote}
- In order to compute the X and Y coordinates for the 2D profiles a TUIO
- tracker implementation needs to divide these values by the actual sensor
- dimension, while a TUIO client implementation consequently can scale these
- values back to the actual screen dimension.
- \end{quote}
- \newpage
- \section{Gesture tracker registration}
- \label{sec:tracker-registration}
- When a gesture handler is added to an event area by an application, the event
- area must create a gesture tracker that detects the corresponding gesture. To
- do this, the architecture must be aware of the existing gesture trackers and
- the gestures they support. The architecture provides a registration system for
- gesture trackers. Each gesture tracker implementation contains a list of
- supported gesture types. These gesture types are mapped to the gesture tracker
- class by the registration system. When an event area needs to create a gesture
- tracker for a gesture type that is not yet being detected, the class name of
- the new created gesture tracker is loaded from this map. Registration of a
- gesture tracker is very straight-forward, as shown by the following Python
- code:
- \begin{verbatim}
- from trackers import register_tracker
- # Create a gesture tracker implementation
- class TapTracker(GestureTracker):
- supported_gestures = ["tap", "single_tap", "double_tap"]
- # Methods for gesture detection go here
- # Register the gesture tracker with the architecture
- register_tracker(TapTracker)
- \end{verbatim}
- \section{Basic tracker}
- \label{sec:basictracker}
- The ``basic tracker'' implementation exists only to provide access to low-level
- events in an application. Low-level events are only handled by gesture
- trackers, not by the application itself. Therefore, the basic tracker maps
- \emph{point\_\{down,move,up\}} events to equally named gestures that can be
- handled by the application.
- \section{Tap tracker}
- \label{sec:taptracker}
- The ``tap tracker'' detects three types of tap gestures:
- \begin{enumerate}
- \item The basic \emph{tap} gesture is triggered when a touch point releases
- the touch surface within a certain time and distance of its initial
- position. When a \emph{point\_down} event is received, its location is
- saved along with the current timestamp. On the next \emph{point\_up}
- event of the touch point, the difference in time and position with its
- saved values are compared to predefined thresholds to determine whether
- a \emph{tap} gesture should be triggered.
- \item A \emph{double tap} gesture consists of two sequential \emph{tap}
- gestures that are located within a certain distance of each other, and
- occur within a certain time window. When a \emph{tap} gesture is
- triggered, the tracker saves it as the ``last tap'' along with the
- current timestamp. When another \emph{tap} gesture is triggered, its
- location and the current timestamp are compared to those of the ``last
- tap'' gesture to determine whether a \emph{double tap} gesture should
- be triggered. If so, the gesture is triggered at the location of the
- ``last tap'', because the second tap may be less accurate.
- \item A separate thread handles detection of \emph{single tap} gestures at
- a rate of thirty times per second. When the time since the ``last tap''
- exceeds the maximum time between two taps of a \emph{double tap}
- gesture, a \emph{single tap} gesture is triggered.
- \end{enumerate}
- The \emph{single tap} gesture exists to be able to make a distinction between
- single and double tap gestures. This distinction is not possible with the
- regular \emph{tap} gesture, since the first \emph{tap} gesture has already been
- handled by the application when the second \emph{tap} of a \emph{double tap}
- gesture is triggered.
- \section{Transformation tracker}
- \label{sec:transformationtracker}
- The transformation tracker triggers \emph{rotate}, \emph{pinch}, \emph{drag}
- and \emph{flick} gestures. These gestures use the centroid of all touch points.
- A \emph{rotate} gesture uses the difference in angle relative to the centroid
- of all touch points, and \emph{pinch} uses the difference in distance. Both
- values are normalized using division by the number of touch points $N$. A
- \emph{pinch} gesture contains a scale factor, and therefore uses a division of
- the current by the previous average distance to the centroid. Any movement of
- the centroid is used for \emph{drag} gestures. When a dragged touch point is
- released, a \emph{flick} gesture is triggered in the direction of the
- \emph{drag} gesture.
- Figure \ref{fig:transformationtracker} shows an example situation in which a
- touch point is moved, triggering a \emph{pinch} gesture, a \emph{rotate}
- gesture and a \emph{drag} gesture.
- \transformationtracker
- The \emph{pinch} gesture in figure \ref{fig:pinchrotate} uses the ratio
- $d_2:d_1$ to calculate its $scale$ parameter. Note that the difference in
- distance $d_2 - d_1$ and the difference in angle $\alpha$ both relate to a
- single touch point. The \emph{pinch} and \emph{rotate} gestures that are
- triggered relate to all touch points, using the average of distances and
- angles. Since all except one of the touch points have not moved, their
- differences in distance and angle are zero. Thus, the averages can be
- calculated by dividing the differences in distance and angle of the moved touch
- point by the number of touch points $N$. The $scale$ parameter represents the
- scale relative to the previous situation, which results in the following
- formula:
- $$pinch.scale = \frac{d_1 + \frac{d_2 - d_1}{N}}{d_1}$$
- The angle used for the \emph{rotate} gesture is only divided by the number of
- touch points to obtain an average rotation of all touch points:
- $$rotate.angle = \frac{\alpha}{N}$$
- \chapter{Test applications}
- \label{chapter:test-applications}
- Two test case applications have been created to test if the design ``works'' in
- a practical application, and to detect its flaws. One application is mainly
- used to test the gesture tracker implementations. The second application uses
- multiple event areas in a tree structure, demonstrating event delegation and
- propagation. The second application also defines a custom gesture tracker.
- \section{Full screen Pygame application}
- %The goal of this application was to experiment with the TUIO
- %protocol, and to discover requirements for the architecture that was to be
- %designed. When the architecture design was completed, the application was rewritten
- %using the new architecture components. The original variant is still available
- %in the ``experimental'' folder of the Git repository \cite{gitrepos}.
- An implementation of the detection of some simple multi-touch gestures (single
- tap, double tap, rotation, pinch and drag) using Processing\footnote{Processing
- is a Java-based programming environment with an export possibility for Android.
- See also \cite{processing}.} can be found in a forum on the Processing website
- \cite{processingMT}. The application has been ported to Python and adapted to
- receive input from the TUIO protocol. The implementation is fairly simple, but
- it yields some appealing results (see figure \ref{fig:draw}). In the original
- application, the detection logic of all gestures is combined in a single class
- file. As predicted by the GART article \cite{GART}, this leads to over-complex
- code that is difficult to read and debug.
- The original application code consists of two main classes. The ``multi-touch
- server'' starts a ``TUIO server'' that translates TUIO events to
- ``point\_\{down,move,up\}'' events. Detection of ``tap'' and ``double tap''
- gestures is performed immediately after an event is received. Other gesture
- detection runs in a separate thread, using the following loop:
- \begin{verbatim}
- 60 times per second do:
- detect `single tap' based on the time since the latest `tap' gesture
- if points have been moved, added or removed since last iteration do:
- calculate the centroid of all points
- detect `drag' using centroid movement
- detect `rotation' using average orientation of all points to centroid
- detect `pinch' using average distance of all points to centroid
- \end{verbatim}
- There are two problems with the implementation described above. In the first
- place, low-level events are not grouped before gesture detection. The gesture
- detection uses all events for a single gesture. Therefore, only one element at
- a time can be rotated/resized etc. (see also section \ref{sec:areas}).
- Secondly, all detection code is located in the same class file. To extend the
- application with new gestures, a programmer must extend the code in this class
- file and therefore understand its structure. Since the main loop calls specific
- gesture detection components explicitly in a certain order, the programmer must
- alter the main loop to call custom gesture detection code. This is a problem
- because this way of extending code is not scalable over time. The class file
- would become more and more complex when extended with new gestures. The two
- problems have been solved using event areas and gesture trackers from the
- reference implementation. The gesture detection code has been separated into
- two different gesture trackers, which are the ``tap'' and ``transformation''
- trackers mentioned in chapter \ref{chapter:implementation}.
- The positions of all touch objects and their centroid are drawn using the
- Pygame library. Since the Pygame library does not provide support to find the
- location of the display window, the root event area captures events in the
- entire screen surface. The application can be run either full screen or in
- windowed mode. If windowed, screen-wide gesture coordinates are mapped to the
- size of the Pyame window. In other words, the Pygame window always represents
- the entire touch surface. The output of the application can be seen in figure
- \ref{fig:draw}.
- \begin{figure}[h!]
- \center
- \includegraphics[scale=0.4]{data/pygame_draw.png}
- \caption{
- Output of the experimental drawing program. It draws all touch points
- and their centroid on the screen (the centroid is used for rotation and
- pinch detection). It also draws a green rectangle which responds to
- rotation and pinch events.
- }
- \label{fig:draw}
- \end{figure}
- \section{GTK+/Cairo application}
- \label{sec:testapp}
- The second test application uses the GIMP toolkit (GTK+) \cite{GTK} to create
- its user interface. The PyGTK library \cite{PyGTK} is used to address GTK+
- functions in the Python application. Since GTK+ defines a main event loop that
- is started in order to use the interface, the architecture implementation runs
- in a separate thread.
- The application creates a main window, whose size and position are synchronized
- with the root event area of the architecture. The synchronization is handled
- automatically by a \texttt{GtkEventWindow} object, which is a subclass of
- \texttt{gtk.Window}. This object serves as a layer that connects the event area
- functionality of the architecture to GTK+ windows. The following Python code
- captures the essence of the synchronization layer:
- \begin{verbatim}
- class GtkEventWindow(Window):
- def __init__(self, width, height):
- Window.__init__(self)
- # Create an event area to represent the GTK window in the gesture
- # detection architecture
- self.area = RectangularArea(0, 0, width, height)
- # The "configure-event" signal is triggered by GTK when the position or
- # size of the window are updated
- self.connect("configure-event", self.sync_area)
- def sync_area(self, win, event):
- # Synchronize the position and size of the event area with that of the
- # GTK window
- self.area.width = event.width
- self.area.height = event.height
- self.area.set_position(*event.get_coords())
- \end{verbatim}
- The application window contains a number of polygons which can be dragged,
- resized and rotated. Each polygon is represented by a separate event area to
- allow simultaneous interaction with different polygons. The main window also
- responds to transformation, by transforming all polygons. Additionally, tapping
- on a polygon changes its color. Double tapping on the application window
- toggles its modus between full screen and windowed.
- An ``overlay'' event area is used to detect all fingers currently touching the
- screen. The application defines a custom gesture tracker, called the ``hand
- tracker'', which is used by the overlay. The hand tracker uses distances
- between detected fingers to detect which fingers belong to the same hand (see
- section \ref{sec:handtracker} for details). The application draws a line from
- each finger to the hand it belongs to, as visible in figure \ref{fig:testapp}.
- \begin{figure}[h!]
- \center
- \includegraphics[scale=0.35]{data/testapp.png}
- \caption{
- Screenshot of the second test application. Two polygons can be dragged,
- rotated and scaled. Separate groups of fingers are recognized as hands,
- each hand is drawn as a centroid with a line to each finger.
- }
- \label{fig:testapp}
- \end{figure}
- To manage the propagation of events used for transformations and tapping, the
- application arranges its event areas in a tree structure as described in
- section \ref{sec:tree}. Each transformable event area has its own
- ``transformation tracker'', which stops the propagation of events used for
- transformation gestures. Because the propagation of these events is stopped,
- overlapping polygons do not cause a problem. Figure \ref{fig:testappdiagram}
- shows the tree structure used by the application.
- Note that the overlay event area, though covering the entire screen surface, is
- not used as the root of the event area tree. Instead, the overlay is placed on
- top of the application window (being a rightmost sibling of the application
- window event area in the tree). This is necessary, because the transformation
- trackers in the application window stop the propagation of events. The hand
- tracker needs to capture all events to be able to give an accurate
- representations of all fingers touching the screen Therefore, the overlay
- should delegate events to the hand tracker before they are stopped by a
- transformation tracker. Placing the overlay over the application window forces
- the screen event area to delegate events to the overlay event area first. The
- event area implementation delegates events to its children in right-to left
- order, because area's that are added to the tree later are assumed to be
- positioned over their previously added siblings.
- \testappdiagram
- \subsection{Hand tracker}
- \label{sec:handtracker}
- The hand tracker sees each touch point as a finger. Based on a predefined
- distance threshold, each finger is assigned to a hand. Each hand consists of a
- list of finger locations, and the centroid of those locations.
- When a new finger is detected on the touch surface (a \emph{point\_down} event),
- the distance from that finger to all hand centroids is calculated. The hand to
- which the distance is the shortest may be the hand that the finger belongs to.
- If the distance is larger than the predefined distance threshold, the finger is
- assumed to be a new hand and \emph{hand\_down} gesture is triggered. Otherwise,
- the finger is assigned to the closest hand. In both cases, a
- \emph{finger\_down} gesture is triggered.
- Each touch point is assigned an ID by the reference implementation. When the
- hand tracker assigns a finger to a hand after a \emph{point\_down} event, its
- touch point ID is saved in a hash map\footnote{In computer science, a hash
- table or hash map is a data structure that uses a hash function to map
- identifying values, known as keys (e.g., a person's name), to their associated
- values (e.g., their telephone number). Source: Wikipedia \cite{wikihashmap}.}
- with the \texttt{Hand} object. When a finger moves (a \emph{point\_move} event)
- or releases the touch surface (\emph{point\_up}), The corresponding hand is
- loaded from the hash map and triggers a \emph{finger\_move} or
- \emph{finger\_up} gesture. If a released finger is the last of a hand, that
- hand is removed with a \emph{hand\_up} gesture.
- \section{Results}
- \label{sec:results}
- The Pygame application is based on existing program code, which has been be
- broken up into the components of the architecture. The application incorporates
- the most common multi-touch gestures, such as tapping and transformation
- gestures. All features from the original application are still supported in the
- revised application, so the component-based architecture design does not
- propose a limiting factor. Rather than that, the program code has become more
- maintainable and extendable due to the modular setup. The gesture tracker-based
- design has even allowed the detection of tap and transformation gestures to be
- moved to the reference implementation of the architecture, whereas it was
- originally part of the test application.
- The GTK+ application uses a more extended tree structure to arrange its event
- areas, so that it can use the powerful concept of event propagation. The
- application does show that the construction of such a tree is not always
- straight-forward: the ``overlay'' event area covers the entire touch surface,
- but is not the root of the tree. Designing the tree structure requires an
- understanding of event propagation by the application developer.
- Some work goes into the synchronization of application widgets with their event
- areas. The GTK+ application defines a class that acts as a synchronization
- layer between the application window and its event area in the architecture.
- This synchronization layer could be used in other applications that use GTK+.
- The ``hand tracker'' used by the GTK+ application is not incorporated within
- the architecture. The use of gesture trackers by the architecture allows the
- application to add new gestures using a single line of code (see section
- \ref{sec:tracker-registration}).
- Apart from the synchronization of event areas with application widgets, both
- applications have no trouble using the architecture implementation in
- combination with their application framework. Thus, the architecture can be
- used alongside existing application frameworks.
- \chapter{Conclusions}
- \label{chapter:conclusions}
- To support different devices, there must be an abstraction of device drivers so
- that gesture detection can be performed on a common set of low-level events.
- This abstraction is provided by the event driver.
- Gestures must be able to occur within a certain area of a touch surface that is
- covered by an application widget. Therefore, low-level events must be divided
- into separate groups before any gesture detection is performed. Event areas
- provide a way to accomplish this. Overlapping event areas are ordered in a tree
- structure that can be synchronized with the widget tree of the application.
- Some applications require the ability to handle an event exclusively for an
- event area. An event propagation mechanism provides a solution for this: the
- propagation of an event in the tree structure can be stopped after gesture
- detection in an event area. \\
- Section \ref{sec:testapp} shows that the structure of the event area tree is
- not necessarily equal to that of the application widget tree. The design of the
- event area tree structure in complex situations requires an understanding of
- event propagation by the application programmer.
- The detection of complex gestures can be approached in several ways. If
- explicit detection code for different gesture is not managed well, program code
- can become needlessly complex. A tracker-based design, in which the detection
- of different types of gesture is separated into different gesture trackers,
- reduces complexity and provides a way to extend a set of detection algorithms.
- The use of gesture trackers is flexible, e.g. complex detection algorithms such
- as machine learning can be used simultaneously with other gesture trackers that
- use explicit detection code. Also, the modularity of this design allows
- extension of the set of supported gestures. Section \ref{sec:testapp}
- demonstrates this extendability.
- A true generic architecture should provide a communication interface that
- provides support for multiple programming languages. A daemon implementation as
- described by section \ref{sec:daemon} is an example of such in interface. With
- this feature, the architecture can be used in combination with a wide range of
- application frameworks.
- \chapter{Suggestions for future work}
- \label{chapter:futurework}
- \section{A generic method for grouping events}
- \label{sec:eventfilter}
- As mentioned in section \ref{sec:areas}, the concept of an event area is based
- on the assumption that the set of originating events that form a particular
- gesture, can be determined exclusively based on the location of the events.
- Since this thesis focuses on multi-touch surface based devices, and every
- object on a multi-touch surface has a position, this assumption is valid.
- However, the design of the architecture is meant to be more generic; to provide
- a structured design for managing gesture detection.
- An in-air gesture detection device, such as the Microsoft Kinect \cite{kinect},
- provides 3D positions. Some multi-touch tables work with a camera that can also
- determine the shape and rotational orientation of objects touching the surface.
- For these devices, events delegated by the event driver have more parameters
- than a 2D position alone. The term ``event area'' is not suitable to describe a
- group of events that consist of these parameters.
- A more generic term for a component that groups similar events is an
- \emph{event filter}. The concept of an event filter is based on the same
- principle as event areas, which is the assumption that gestures are formed from
- a subset of all low-level events. However, an event filter takes all parameters
- of an event into account. An application on the camera-based multi-touch table
- could be to group all objects that are triangular into one filter, and all
- rectangular objects into another. Or, to separate small finger tips from large
- ones to be able to recognize whether a child or an adult touches the table.
- \section{Using a state machine for gesture detection}
- All gesture trackers in the reference implementation are based on the explicit
- analysis of events. Gesture detection is a widely researched subject, and the
- separation of detection logic into different trackers allows for multiple types
- of gesture detection in the same architecture. An interesting question is
- whether multi-touch gestures can be described in a formal, generic way so that
- explicit detection code can be avoided.
- \cite{GART} and \cite{conf/gw/RigollKE97} propose the use of machine learning
- to recognize gestures. To use machine learning, a set of input events forming a
- particular gesture must be represented as a feature vector. A learning set
- containing a set of feature vectors that represent some gesture ``teaches'' the
- machine what the feature of the gesture looks like.
- An advantage of using explicit gesture detection code is the fact that it
- provides a flexible way to specify the characteristics of a gesture, whereas
- the performance of feature vector-based machine learning is dependent on the
- quality of the learning set.
- A better method to describe a gesture might be to specify its features as a
- ``signature''. The parameters of such a signature must be be based on low-level
- events. When a set of input events matches the signature of some gesture, the
- gesture can be triggered. A gesture signature should be a complete description
- of all requirements the set of events must meet to form the gesture.
- A way to describe signatures on a multi-touch surface can be by the use of a
- state machine of its touch objects. The states of a simple touch point could be
- ${down, move, hold, up}$ to indicate respectively that a point is put down, is
- being moved, is held on a position for some time, and is released. In this
- case, a ``drag'' gesture can be described by the sequence $down - move - up$
- and a ``select'' gesture by the sequence $down - hold$. If the set of states is
- not sufficient to describe a desired gesture, a developer can add additional
- states. For example, to be able to make a distinction between an element being
- ``dragged'' or ``thrown'' in some direction on the screen, two additional
- states can be added: ${start, stop}$ to indicate that a point starts and stops
- moving. The resulting state transitions are sequences $down - start - move -
- stop - up$ and $down - start - move - up$ (the latter does not include a $stop$
- to indicate that the element must keep moving after the gesture had been
- performed). The two sequences distinguish a ``drag'' gesture from a ``flick''
- gesture respectively.
- An additional way to describe even more complex gestures is to use other
- gestures in a signature. An example is to combine $select - drag$ to specify
- that an element must be selected before it can be dragged.
- The application of a state machine to describe multi-touch gestures is a
- subject well worth exploring in the future.
- \section{Daemon implementation}
- Section \ref{sec:daemon} proposes the use of a network protocol to communicate
- between an architecture implementation and (multiple) gesture-based
- applications, as illustrated in figure \ref{fig:daemon}. The reference
- implementation does not support network communication. If the architecture
- design is to become successful in the future, the implementation of network
- communication is a must. ZeroMQ (or $\emptyset$MQ) \cite{ZeroMQ} is a
- high-performance software library with support for a wide range of programming
- languages. A future implementation can use this library as the basis for its
- communication layer.
- Ideally, a user can install a daemon process containing the architecture so
- that it is usable for any gesture-based application on the device. Applications
- that use the architecture can specify it as being a software dependency, or
- include it in a software distribution.
- If a final implementation of the architecture is ever released, a good idea
- would be to do so within a community of application developers. A community can
- contribute to a central database of gesture trackers, making the interaction
- from their applications available for use in other applications.
- \bibliographystyle{plain}
- \bibliography{report}{}
- \end{document}
|