| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880 |
- \documentclass[twoside,openright]{uva-bachelor-thesis}
- \usepackage[english]{babel}
- \usepackage[utf8]{inputenc}
- \usepackage{hyperref,graphicx,tikz,subfigure,float}
- % Link colors
- \hypersetup{colorlinks=true,linkcolor=black,urlcolor=blue,citecolor=DarkGreen}
- % Title Page
- \title{A generic architecture for gesture-based interaction}
- \author{Taddeüs Kroes}
- \supervisors{Dr. Robert G. Belleman (UvA)}
- \signedby{Dr. Robert G. Belleman (UvA)}
- \begin{document}
- % Title page
- \maketitle
- \begin{abstract}
- Applications that use complex gesture-based interaction need to translate
- primitive messages from low-level device drivers to complex, high-level
- gestures, and map these gestures to elements in an application. This report
- presents a generic architecture for the detection of complex gestures in an
- application. The architecture translates device driver messages to a common
- set of ``events''. The events are then delegated to a tree of ``event
- areas'', which are used to separate groups of events and assign these
- groups to an element in the application. Gesture detection is performed on
- a group of events assigned to an event area, using detection units called
- ``gesture tackers''. An implementation of the architecture as a daemon
- process would be capable of serving gestures to multiple applications at
- the same time. A reference implementation and two test case applications
- have been created to test the effectiveness of the architecture design.
- \end{abstract}
- % Set paragraph indentation
- \parindent 0pt
- \parskip 1.5ex plus 0.5ex minus 0.2ex
- % Table of content on separate page
- \tableofcontents
- \chapter{Introduction}
- \label{chapter:introduction}
- Surface-touch devices have evolved from pen-based tablets to single-touch
- trackpads, to multi-touch devices like smartphones and tablets. Multi-touch
- devices enable a user to interact with software using hand gestures, making the
- interaction more expressive and intuitive. These gestures are more complex than
- primitive ``click'' or ``tap'' events that are used by single-touch devices.
- Some examples of more complex gestures are ``pinch''\footnote{A ``pinch''
- gesture is formed by performing a pinching movement with multiple fingers on a
- multi-touch surface. Pinch gestures are often used to zoom in or out on an
- object.} and ``flick''\footnote{A ``flick'' gesture is the act of grabbing an
- object and throwing it in a direction on a touch surface, giving it momentum to
- move for some time after the hand releases the surface.} gestures.
- The complexity of gestures is not limited to navigation in smartphones. Some
- multi-touch devices are already capable of recognizing objects touching the
- screen \cite[Microsoft Surface]{mssurface}. In the near future, touch screens
- will possibly be extended or even replaced with in-air interaction (Microsoft's
- Kinect \cite{kinect} and the Leap \cite{leap}).
- The interaction devices mentioned above generate primitive events. In the case
- of surface-touch devices, these are \emph{down}, \emph{move} and \emph{up}
- events. Application programmers who want to incorporate complex, intuitive
- gestures in their application face the challenge of interpreting these
- primitive events as gestures. With the increasing complexity of gestures, the
- complexity of the logic required to detect these gestures increases as well.
- This challenge limits, or even deters the application developer to use complex
- gestures in an application.
- The main question in this research project is whether a generic architecture
- for the detection of complex interaction gestures can be designed, with the
- capability of managing the complexity of gesture detection logic. The ultimate
- goal would be to create an implementation of this architecture that can be
- extended to support a wide range of complex gestures. With the existence of
- such an implementation, application developers do not need to reinvent gesture
- detection for every new gesture-based application.
- \section{Contents of this document}
- The scope of this thesis is limited to the detection of gestures on
- multi-touch surface devices. It presents a design for a generic gesture
- detection architecture for use in multi-touch based applications. A
- reference implementation of this design is used in some test case
- applications, whose purpose is to test the effectiveness of the design and
- detect its shortcomings.
- Chapter \ref{chapter:related} describes related work that inspired a design
- for the architecture. The design is presented in chapter
- \ref{chapter:design}. Chapter \ref{chapter:testapps} presents a reference
- implementation of the architecture, and two test case applications that
- show the practical use of its components as presented in chapter
- \ref{chapter:design}. Finally, some suggestions for future research on the
- subject are given in chapter \ref{chapter:futurework}.
- \chapter{Related work}
- \label{chapter:related}
- Applications that use gesture-based interaction need a graphical user
- interface (GUI) on which gestures can be performed. The creation of a GUI
- is a platform-specific task. For instance, Windows and Linux support
- different window managers. To create a window in a platform-independent
- application, the application would need to include separate functionalities
- for supported platforms. For this reason, GUI-based applications are often
- built on top of an application framework that abstracts platform-specific
- tasks. Frameworks often include a set of tools and events that help the
- developer to easily build advanced GUI widgets.
- % Existing frameworks (and why they're not good enough)
- Some frameworks, such as Nokia's Qt \cite{qt}, provide support for basic
- multi-touch gestures like tapping, rotation or pinching. However, the
- detection of gestures is embedded in the framework code in an inseparable
- way. Consequently, an application developer who wants to use multi-touch
- interaction in an application, is forced to use an application framework
- that includes support for those multi-touch gestures that are required by
- the application. Kivy \cite{kivy} is a GUI framework for Python
- applications, with support for multi-touch gestures. It uses a basic
- gesture detection algorithm that allows developers to define custom
- gestures to some degree \cite{kivygesture} using a set of touch point
- coordinates. However, these frameworks do not provide support for extension
- with custom complex gestures.
- Many frameworks are also device-specific, meaning that they are developed
- for use on either a tablet, smartphone, PC or other device. OpenNI
- \cite{OpenNI2010}, for example, provides API's for only natural interaction
- (NI) devices such as webcams and microphones. The concept of complex
- gesture-based interaction, however, is applicable to a much wider set of
- devices. VRPN \cite{VRPN} provides a software library that abstracts the
- output of devices, which enables it to support a wide set of devices used
- in Virtual Reality (VR) interaction. The framework makes the low-level
- events of these devices accessible in a client application using network
- communication. Gesture detection is not included in VRPN.
- % Methods of gesture detection
- The detection of high-level gestures from low-level events can be
- approached in several ways. GART \cite{GART} is a toolkit for the
- development of gesture-based applications, which states that the best way
- to classify gestures is to use machine learning. The programmer trains an
- application to recognize gestures using a machine learning library from the
- toolkit. Though multi-touch input is not directly supported by the toolkit,
- the level of abstraction does allow for it to be implemented in the form of
- a ``touch'' sensor. The reason to use machine learning is that gesture
- detection ``is likely to become increasingly complex and unmanageable''
- when using a predefined set of rules to detect whether some sensor input
- can be classified as a specific gesture.
- The alternative to machine learning is to define a predefined set of rules
- for each gesture. Manoj Kumar \cite{win7touch} presents a Windows 7
- application, written in Microsofts .NET, which detects a set of basic
- directional gestures based on the movement of a stylus. The complexity of
- the code is managed by the separation of different gesture types in
- different detection units called ``gesture trackers''. The application
- shows that predefined gesture detection rules do not necessarily produce
- unmanageable code.
- \section{Analysis of related work}
- Implementations for the support of complex gesture based interaction do
- already exist. However, gesture detection in these implementations is
- device-specific (Nokia Qt and OpenNI) or limited to use within an
- application framework (Kivy).
- An abstraction of device output allows VRPN and GART to support multiple
- devices. However, VRPN does not incorporate gesture detection. GART does,
- but only in the form of machine learning algorithms. Many applications for
- mobile phones and tablets only use simple gestures such as taps. For this
- category of applications, machine learning is an excessively complex method
- of gesture detection. Manoj Kumar shows that when managed well, a
- predefined set of gesture detection rules is sufficient to detect simple
- gestures.
- This thesis explores the possibility to create an architecture that
- combines support for multiple input devices with different methods of
- gesture detection.
- \chapter{Design}
- \label{chapter:design}
- % Diagrams are defined in a separate file
- \input{data/diagrams}
- \section{Introduction}
- Application frameworks are a necessity when it comes to fast,
- cross-platform development. A generic architecture design should aim to be
- compatible with existing frameworks, and provide a way to detect and extend
- gestures independent of the framework. Since an application framework is
- written in a specific programming language, the architecture should be
- accessible for applications using a language-independent method of
- communication. This intention leads towards the concept of a dedicated
- gesture detection application that serves gestures to multiple applications
- at the same time.
- This chapter describes a design for such an architecture. The architecture
- components are shown by figure \ref{fig:fulldiagram}. Sections
- \ref{sec:multipledrivers} to \ref{sec:daemon} explain the use of all
- components in detail.
- \fulldiagram
- \newpage
- \section{Supporting multiple drivers}
- \label{sec:multipledrivers}
- The TUIO protocol \cite{TUIO} is an example of a driver that can be used by
- multi-touch devices. TUIO uses ALIVE- and SET-messages to communicate
- low-level touch events (see appendix \ref{app:tuio} for more details).
- These messages are specific to the API of the TUIO protocol. Other drivers
- may use different messages types. To support more than one driver in the
- architecture, there must be some translation from device-specific messages
- to a common format for primitive touch events. After all, the gesture
- detection logic in a ``generic'' architecture should not be implemented
- based on device-specific messages. The event types in this format should be
- chosen so that multiple drivers can trigger the same events. If each
- supported driver would add its own set of event types to the common format,
- the purpose of it being ``common'' would be defeated.
- A minimal expectation for a touch device driver is that it detects simple
- touch points, with a ``point'' being an object at an $(x, y)$ position on
- the touch surface. This yields a basic set of events: $\{point\_down,
- point\_move, point\_up\}$.
- The TUIO protocol supports fiducials\footnote{A fiducial is a pattern used
- by some touch devices to identify objects.}, which also have a rotational
- property. This results in a more extended set: $\{point\_down, point\_move,
- point\_up, object\_down, object\_move, object\_up,\\ object\_rotate\}$.
- Due to their generic nature, the use of these events is not limited to the
- TUIO protocol. Another driver that can keep apart rotated objects from
- simple touch points could also trigger them.
- The component that translates device-specific messages to common events,
- will be called the \emph{event driver}. The event driver runs in a loop,
- receiving and analyzing driver messages. When a sequence of messages is
- analyzed as an event, the event driver delegates the event to other
- components in the architecture for translation to gestures.
- Support for a touch driver can be added by adding an event driver
- implementation. The choice of event driver implementation that is used in an
- application is dependent on the driver support of the touch device being
- used.
- Because driver implementations have a common output format in the form of
- events, multiple event drivers can be used at the same time (see figure
- \ref{fig:multipledrivers}). This design feature allows low-level events
- from multiple devices to be aggregated into high-level gestures.
- \multipledriversdiagram
- \section{Event areas: connecting gesture events to widgets}
- \label{sec:areas}
- Touch input devices are unaware of the graphical input
- widgets\footnote{``Widget'' is a name commonly used to identify an element
- of a graphical user interface (GUI).} rendered by an application, and
- therefore generate events that simply identify the screen location at which
- an event takes place. User interfaces of applications that do not run in
- full screen modus are contained in a window. Events which occur outside the
- application window should not be handled by the application in most cases.
- What's more, a widget within the application window itself should be able
- to respond to different gestures. E.g. a button widget may respond to a
- ``tap'' gesture to be activated, whereas the application window responds to
- a ``pinch'' gesture to be resized. In order to be able to direct a gesture
- to a particular widget in an application, a gesture must be restricted to
- the area of the screen covered by that widget. An important question is if
- the architecture should offer a solution to this problem, or leave the task
- of assigning gestures to application widgets to the application developer.
- If the architecture does not provide a solution, the ``gesture detection''
- component in figure \ref{fig:fulldiagram} receives all events that occur on
- the screen surface. The gesture detection logic thus uses all events as
- input to detect a gesture. This leaves no possibility for a gesture to
- occur at multiple screen positions at the same time. The problem is
- illustrated in figure \ref{fig:ex1}, where two widgets on the screen can be
- rotated independently. The rotation detection component that detects
- rotation gestures receives all four fingers as input. If the two groups of
- finger events are not separated by cluster detection, only one rotation
- event will occur.
- \examplefigureone
- A gesture detection component could perform a heuristic way of cluster
- detection based on the distance between events. However, this method cannot
- guarantee that a cluster of events corresponds with a particular
- application widget. In short, a gesture detection component is difficult to
- implement without awareness of the location of application widgets.
- Secondly, the application developer still needs to direct gestures to a
- particular widget manually. This requires geometric calculations in the
- application logic, which is a tedious and error-prone task for the
- developer.
- The architecture described here groups events that occur inside the area
- covered by a widget, before passing them on to a gesture detection
- component. Different gesture detection components can then detect gestures
- simultaneously, based on different sets of input events. An area of the
- screen surface is represented by an \emph{event area}. An event area
- filters input events based on their location, and then delegates events to
- gesture detection components that are assigned to the event area. Events
- which are located outside the event area are not delegated to its gesture
- detection components.
- In the example of figure \ref{fig:ex1}, the two rotatable widgets can be
- represented by two event areas, each having a different rotation detection
- component. Each event area can consist of four corner locations of the
- square it represents. To detect whether an event is located inside a
- square, the event areas use a point-in-polygon (PIP) test \cite{PIP}. It is
- the task of the client application to update the corner locations of the
- event area with those of the widget.
- \subsection{Callback mechanism}
- When a gesture is detected by a gesture detection component, it must be
- handled by the client application. A common way to handle events in an
- application is a ``callback'' mechanism: the application developer binds a
- function to an event, that is called when the event occurs. Because of the
- familiarity of this concept with developers, the architecture uses a
- callback mechanism to handle gestures in an application. Callback handlers
- are bound to event areas, since events areas controls the grouping of
- events and thus the occurrence of gestures in an area of the screen.
- \subsection{Area tree}
- \label{sec:tree}
- A basic usage of event areas in the architecture would be a list of event
- areas. When the event driver delegates an event, it is accepted by each
- event area that contains the event coordinates.
- If the architecture were to be used in combination with an application
- framework, each widget that responds to gestures should have a mirroring
- event area that synchronizes its location with that of the widget. Consider
- a panel with five buttons that all listen to a ``tap'' event. If the
- location of the panel changes as a result of movement of the application
- window, the positions of all buttons have to be updated too.
- This process is simplified by the arrangement of event areas in a tree
- structure. A root event area represents the panel, containing five other
- event areas which are positioned relative to the root area. The relative
- positions do not need to be updated when the panel area changes its
- position. GUI frameworks use this kind of tree structure to manage
- graphical widgets.
- If the GUI toolkit provides an API for requesting the position and size of
- a widget, a recommended first step when developing an application is to
- create a subclass of the area that automatically synchronizes with the
- position of a widget from the GUI framework. For example, the test
- application described in section \ref{sec:testapp} extends the GTK
- \cite{GTK} application window widget with the functionality of a
- rectangular event area, to direct touch events to an application window.
- \subsection{Event propagation}
- \label{sec:eventpropagation}
- Another problem occurs when event areas overlap, as shown by figure
- \ref{fig:eventpropagation}. When the white square is dragged, the gray
- square should stay at its current position. This means that events that are
- used for dragging of the white square, should not be used for dragging of
- the gray square. The use of event areas alone does not provide a solution
- here, since both the gray and the white event area accept an event that
- occurs within the white square.
- The problem described above is a common problem in GUI applications, and
- there is a common solution (used by GTK \cite{gtkeventpropagation}, among
- others). An event is passed to an ``event handler''. If the handler returns
- \texttt{true}, the event is considered ``handled'' and is not
- ``propagated'' to other widgets. Applied to the example of the draggable
- squares, the rotation detection component of the white square should stop
- the propagation of events to the event area of the gray square.
- In the example, rotation of the white square has priority over rotation of
- the gray square because the white area is the widget actually being touched
- at the screen surface. In general, events should be delegated to event
- areas according to the order in which the event areas are positioned over
- each other. The tree structure in which event areas are arranged, is an
- ideal tool to determine the order in which an event is delegated. An
- object touching the screen is essentially touching the deepest event area
- in the tree that contains the triggered event, which must be the first to
- receive the event. When the gesture trackers of the event area are
- finished with the event, it is propagated to the siblings and parent in the
- event area tree. Optionally, a gesture tracker can stop the propagation of
- the event by its corresponding event area. Figure
- \ref{fig:eventpropagation} demonstrates event propagation in the example of
- the draggable squares.
- \eventpropagationfigure
- An additional type of event propagation is ``immediate propagation'', which
- indicates propagation of an event from one gesture detection component to
- another. This is applicable when an event area uses more than one gesture
- detection component. When regular propagation is stopped, the event is
- propagated to other gesture detection components first, before actually
- being stopped. One of the components can also stop the immediate
- propagation of an event, so that the event is not passed to the next
- gesture detection component, nor to the ancestors of the event area.
- The concept of an event area is based on the assumption that the set of
- originating events that form a particular gesture, can be determined based
- exclusively on the location of the events. This is a reasonable assumption
- for simple touch objects whose only parameter is a position, such as a pen
- or a human finger. However, more complex touch objects can have additional
- parameters, such as rotational orientation or color. An even more generic
- concept is the \emph{event filter}, which detects whether an event should
- be assigned to a particular gesture detection component based on all
- available parameters. This level of abstraction provides additional methods
- of interaction. For example, a camera-based multi-touch surface could make
- a distinction between gestures performed with a blue gloved hand, and
- gestures performed with a green gloved hand.
- As mentioned in the introduction chapter [\ref{chapter:introduction}], the
- scope of this thesis is limited to multi-touch surface based devices, for
- which the \emph{event area} concept suffices. Section \ref{sec:eventfilter}
- explores the possibility of event areas to be replaced with event filters.
- \section{Detecting gestures from low-level events}
- \label{sec:gesture-detection}
- The low-level events that are grouped by an event area must be translated
- to high-level gestures in some way. Simple gestures, such as a tap or the
- dragging of an element using one finger, are easy to detect by comparing
- the positions of sequential $point\_down$ and $point\_move$ events. More
- complex gestures, like the writing of a character from the alphabet,
- require more advanced detection algorithms.
- Sequences of events that are triggered by a multi-touch based surfaces are
- often of a manageable complexity. An imperative programming style is
- sufficient to detect many common gestures, like rotation and dragging. The
- imperative programming style is also familiar and understandable for a wide
- range of application developers. Therefore, the architecture should support
- an imperative style of gesture detection. A problem with an imperative
- programming style is that the explicit detection of different gestures
- requires different gesture detection components. If these components are
- not managed well, the detection logic is prone to become chaotic and
- over-complex.
- A way to detect more complex gestures based on a sequence of input events,
- is with the use of machine learning methods, such as the Hidden Markov
- Models \footnote{A Hidden Markov Model (HMM) is a statistical model without
- a memory, it can be used to detect gestures based on the current input
- state alone.} used for sign language detection by Gerhard Rigoll et al.
- \cite{conf/gw/RigollKE97}. A sequence of input states can be mapped to a
- feature vector that is recognized as a particular gesture with a certain
- probability. An advantage of using machine learning with respect to an
- imperative programming style is that complex gestures can be described
- without the use of explicit detection logic, thus reducing code complexity.
- For example, the detection of the character `A' being written on the screen
- is difficult to implement using an imperative programming style, while a
- trained machine learning system can produce a match with relative ease.
- To manage complexity and support multiple styles of gesture detection
- logic, the architecture has adopted the tracker-based design as described
- by Manoj Kumar \cite{win7touch}. Different detection components are wrapped
- in separate gesture tracking units called \emph{gesture trackers}. The
- input of a gesture tracker is provided by an event area in the form of
- events. Each gesture detection component is wrapped in a gesture tracker
- with a fixed type of input and output. Internally, the gesture tracker can
- adopt any programming style. A character recognition component can use an
- HMM, whereas a tap detection component defines a simple function that
- compares event coordinates.
- When a gesture tracker detects a gesture, this gesture is triggered in the
- corresponding event area. The event area then calls the callbacks which are
- bound to the gesture type by the application.
- The use of gesture trackers as small detection units allows extendability
- of the architecture. A developer can write a custom gesture tracker and
- register it in the architecture. The tracker can use any type of detection
- logic internally, as long as it translates low-level events to high-level
- gestures.
- An example of a possible gesture tracker implementation is a
- ``transformation tracker'' that detects rotation, scaling and translation
- gestures.
- \section{Serving multiple applications}
- \label{sec:daemon}
- The design of the architecture is essentially complete with the components
- specified in this chapter. However, one specification has not yet been
- discussed: the ability to address the architecture using a method of
- communication independent of the application's programming language.
- If the architecture and a gesture-based application are written in the same
- language, the main loop of the architecture can run in a separate thread of
- the application. If the application is written in a different language, the
- architecture has to run in a separate process. Since the application needs
- to respond to gestures that are triggered by the architecture, there must
- be a communication layer between the separate processes.
- A common and efficient way of communication between two separate processes
- is through the use of a network protocol. In this particular case, the
- architecture can run as a daemon\footnote{``daemon'' is a name Unix uses to
- indicate that a process runs as a background process.} process, listening
- to driver messages and triggering gestures in registered applications.
- \vspace{-0.3em}
- \daemondiagram
- An advantage of a daemon setup is that it can serve multiple applications
- at the same time. Alternatively, each application that uses gesture
- interaction would start its own instance of the architecture in a separate
- process, which would be less efficient. The network communication layer
- also allows the architecture and a client application to run on separate
- machines, thus distributing computational load. The other machine may even
- use a different operating system.
- \section{Example usage}
- \label{sec:example}
- This section describes an extended example to illustrate the data flow of
- the architecture. The example application listens to tap events on a button
- within an application window. The window also contains a draggable circle.
- The application window can be resized using \emph{pinch} gestures. Figure
- \ref{fig:examplediagram} shows the architecture created by the pseudo code
- below.
- \begin{verbatim}
- initialize GUI framework, creating a window and nessecary GUI widgets
- create a root event area that synchronizes position and size with the application window
- define 'rotation' gesture handler and bind it to the root event area
- create an event area with the position and radius of the circle
- define 'drag' gesture handler and bind it to the circle event area
- create an event area with the position and size of the button
- define 'tap' gesture handler and bind it to the button event area
- create a new event server and assign the created root event area to it
- start the event server in a new thread
- start the GUI main loop in the current thread
- \end{verbatim}
- \examplediagram
- \chapter{Implementation and test applications}
- \label{chapter:testapps}
- A reference implementation of the design has been written in Python. Two test
- applications have been created to test if the design ``works'' in a practical
- application, and to detect its flaws. One application is mainly used to test
- the gesture tracker implementations. The other application uses multiple event
- areas in a tree structure, demonstrating event delegation and propagation. The
- second application also defines a custom gesture tracker.
- To test multi-touch interaction properly, a multi-touch device is required. The
- University of Amsterdam (UvA) has provided access to a multi-touch table from
- PQlabs. The table uses the TUIO protocol \cite{TUIO} to communicate touch
- events. See appendix \ref{app:tuio} for details regarding the TUIO protocol.
- %The reference implementation and its test applications are a Proof of Concept,
- %meant to show that the architecture design is effective.
- %that translates TUIO messages to some common multi-touch gestures.
- \section{Reference implementation}
- \label{sec:implementation}
- The reference implementation is written in Python and available at
- \cite{gitrepos}. The following component implementations are included:
- \textbf{Event drivers}
- \begin{itemize}
- \item TUIO driver, using only the support for simple touch points with an
- $(x, y)$ position.
- \end{itemize}
- \textbf{Event areas}
- \begin{itemize}
- \item Circular area
- \item Rectangular area
- \item Polygon area
- \item Full screen area
- \end{itemize}
- \textbf{Gesture trackers}
- \begin{itemize}
- \item Basic tracker, supports $point\_down,~point\_move,~point\_up$ gestures.
- \item Tap tracker, supports $tap,~single\_tap,~double\_tap$ gestures.
- \item Transformation tracker, supports $rotate,~pinch,~drag,~flick$ gestures.
- \end{itemize}
- The implementation does not include a network protocol to support the daemon
- setup as described in section \ref{sec:daemon}. Therefore, it is only usable in
- Python programs. The two test programs are also written in Python.
- The event area implementations contain some geometric functions to determine
- whether an event should be delegated to an event area. All gesture trackers
- have been implemented using an imperative programming style. Technical details
- about the implementation of gesture detection are described in appendix
- \ref{app:implementation-details}.
- \section{Full screen Pygame application}
- %The goal of this application was to experiment with the TUIO
- %protocol, and to discover requirements for the architecture that was to be
- %designed. When the architecture design was completed, the application was rewritten
- %using the new architecture components. The original variant is still available
- %in the ``experimental'' folder of the Git repository \cite{gitrepos}.
- An implementation of the detection of some simple multi-touch gestures (single
- tap, double tap, rotation, pinch and drag) using Processing\footnote{Processing
- is a Java-based programming environment with an export possibility for Android.
- See also \cite{processing}.} can be found in a forum on the Processing website
- \cite{processingMT}. The application has been ported to Python and adapted to
- receive input from the TUIO protocol. The implementation is fairly simple, but
- it yields some appealing results (see figure \ref{fig:draw}). In the original
- application, the detection logic of all gestures is combined in a single class
- file. As predicted by the GART article \cite{GART}, this leads to over-complex
- code that is difficult to read and debug.
- The application has been rewritten using the reference implementation of the
- architecture. The detection code is separated into two different gesture
- trackers, which are the ``tap'' and ``transformation'' trackers mentioned in
- section \ref{sec:implementation}.
- The positions of all touch objects and their centroid are drawn using the
- Pygame library. Since the Pygame library does not provide support to find the
- location of the display window, the root event area captures events in the
- entire screen surface. The application can be run either full screen or in
- windowed mode. If windowed, screen-wide gesture coordinates are mapped to the
- size of the Pyame window. In other words, the Pygame window always represents
- the entire touch surface. The output of the application can be seen in figure
- \ref{fig:draw}.
- \begin{figure}[h!]
- \center
- \includegraphics[scale=0.4]{data/pygame_draw.png}
- \caption{Output of the experimental drawing program. It draws all touch
- points and their centroid on the screen (the centroid is used for rotation
- and pinch detection). It also draws a green rectangle which responds to
- rotation and pinch events.}
- \label{fig:draw}
- \end{figure}
- \section{GTK+/Cairo application}
- \label{sec:testapp}
- The second test application uses the GIMP toolkit (GTK+) \cite{GTK} to create
- its user interface. Since GTK+ defines a main event loop that is started in
- order to use the interface, the architecture implementation runs in a separate
- thread.
- The application creates a main window, whose size and position are synchronized
- with the root event area of the architecture. The synchronization is handled
- automatically by a \texttt{GtkEventWindow} object, which is a subclass of
- \texttt{gtk.Window}. This object serves as a layer that connects the event area
- functionality of the architecture to GTK+ windows.
- The main window contains a number of polygons which can be dragged, resized and
- rotated. Each polygon is represented by a separate event area to allow
- simultaneous interaction with different polygons. The main window also responds
- to transformation, by transforming all polygons. Additionally, double tapping
- on a polygon changes its color.
- An ``overlay'' event area is used to detect all fingers currently touching the
- screen. The application defines a custom gesture tracker, called the ``hand
- tracker'', which is used by the overlay. The hand tracker uses distances
- between detected fingers to detect which fingers belong to the same hand. The
- application draws a line from each finger to the hand it belongs to, as visible
- in figure \ref{fig:testapp}.
- \begin{figure}[h!]
- \center
- \includegraphics[scale=0.35]{data/testapp.png}
- \caption{Screenshot of the second test application. Two polygons can be
- dragged, rotated and scaled. Separate groups of fingers are recognized as
- hands, each hand is drawn as a centroid with a line to each finger.}
- \label{fig:testapp}
- \end{figure}
- To manage the propagation of events used for transformations, the applications
- arranges its event areas in a tree structure as described in section
- \ref{sec:tree}. Each transformable event area has its own ``transformation
- tracker'', which stops the propagation of events used for transformation
- gestures. Because the propagation of these events is stopped, overlapping
- polygons do not cause a problem. Figure \ref{fig:testappdiagram} shows the tree
- structure used by the application.
- Note that the overlay event area, though covering the whole screen surface, is
- not the root event area. The overlay event area is placed on top of the
- application window (being a rightmost sibling of the application window event
- area in the tree). This is necessary, because the transformation trackers stop
- event propagation. The hand tracker needs to capture all events to be able to
- give an accurate representations of all fingers touching the screen Therefore,
- the overlay should delegate events to the hand tracker before they are stopped
- by a transformation tracker. Placing the overlay over the application window
- forces the screen event area to delegate events to the overlay event area
- first.
- \testappdiagram
- \section{Results}
- \emph{TODO: Tekortkomingen aangeven die naar voren komen uit de tests}
- % Verschillende apparaten/drivers geven een ander soort primitieve events af.
- % Een vertaling van deze device-specifieke events naar een algemeen formaat van
- % events is nodig om gesture detection op een generieke manier te doen.
- % Door input van meerdere drivers door dezelfde event driver heen te laten gaan
- % is er ondersteuning voor meerdere apparaten tegelijkertijd.
- % Event driver levert low-level events. niet elke event hoort bij elke gesture,
- % dus moet er een filtering plaatsvinden van welke events bij welke gesture
- % horen. Areas geven de mogelijkheid hiervoor op apparaten waarvan het
- % filteren locatiegebonden is.
- % Het opsplitsten van gesture detection voor gesture trackers is een manier om
- % flexibel te zijn in ondersteunde types detection logic, en het beheersbaar
- % houden van complexiteit.
- \chapter{Suggestions for future work}
- \label{chapter:futurework}
- \section{A generic method for grouping events}
- \label{sec:eventfilter}
- As mentioned in section \ref{sec:areas}, the concept of an event area is based
- on the assumption that the set of originating events that form a particular
- gesture, can be determined based exclusively on the location of the events.
- Since this thesis focuses on multi-touch surface based devices, and every
- object on a multi-touch surface has a position, this assumption is valid.
- However, the design of the architecture is meant to be more generic; to provide
- a structured design for managing gesture detection.
- An in-air gesture detection device, such as the Microsoft Kinect \cite{kinect},
- provides 3D positions. Some multi-touch tables work with a camera that can also
- determine the shape and rotational orientation of objects touching the surface.
- For these devices, events delegated by the event driver have more parameters
- than a 2D position alone. The term ``area'' is not suitable to describe a group
- of events that consist of these parameters.
- A more generic term for a component that groups similar events is the
- \emph{event filter}. The concept of an event filter is based on the same
- principle as event areas, which is the assumption that gestures are formed from
- a subset of all events. However, an event filter takes all parameters of an
- event into account. An application on the camera-based multi-touch table could
- be to group all objects that are triangular into one filter, and all
- rectangular objects into another. Or, to separate small finger tips from large
- ones to be able to recognize whether a child or an adult touches the table.
- \section{Using a state machine for gesture detection}
- All gesture trackers in the reference implementation are based on the explicit
- analysis of events. Gesture detection is a widely researched subject, and the
- separation of detection logic into different trackers allows for multiple types
- of gesture detection in the same architecture. An interesting question is
- whether multi-touch gestures can be described in a formal way so that explicit
- detection code can be avoided.
- \cite{GART} and \cite{conf/gw/RigollKE97} propose the use of machine learning
- to recognize gestures. To use machine learning, a set of input events forming a
- particular gesture must be represented as a feature vector. A learning set
- containing a set of feature vectors that represent some gesture ``teaches'' the
- machine what the feature of the gesture looks like.
- An advantage of using explicit gesture detection code is the fact that it
- provides a flexible way to specify the characteristics of a gesture, whereas
- the performance of feature vector-based machine learning is dependent on the
- quality of the learning set.
- A better method to describe a gesture might be to specify its features as a
- ``signature''. The parameters of such a signature must be be based on input
- events. When a set of input events matches the signature of some gesture, the
- gesture is be triggered. A gesture signature should be a complete description
- of all requirements the set of events must meet to form the gesture.
- A way to describe signatures on a multi-touch surface can be by the use of a
- state machine of its touch objects. The states of a simple touch point could be
- ${down, move, up, hold}$ to indicate respectively that a point is put down, is
- being moved, is held on a position for some time, and is released. In this
- case, a ``drag'' gesture can be described by the sequence $down - move - up$
- and a ``select'' gesture by the sequence $down - hold$. If the set of states is
- not sufficient to describe a desired gesture, a developer can add additional
- states. For example, to be able to make a distinction between an element being
- ``dragged'' or ``thrown'' in some direction on the screen, two additional
- states can be added: ${start, stop}$ to indicate that a point starts and stops
- moving. The resulting state transitions are sequences $down - start - move -
- stop - up$ and $down - start - move - up$ (the latter does not include a $stop$
- to indicate that the element must keep moving after the gesture had been
- performed).
- An additional way to describe even more complex gestures is to use other
- gestures in a signature. An example is to combine $select - drag$ to specify
- that an element must be selected before it can be dragged.
- The application of a state machine to describe multi-touch gestures is an
- subject well worth exploring in the future.
- \section{Daemon implementation}
- Section \ref{sec:daemon} proposes the use of a network protocol to communicate
- between an architecture implementation and (multiple) gesture-based
- applications, as illustrated in figure \ref{fig:daemon}. The reference
- implementation does not support network communication. If the architecture
- design is to become successful in the future, the implementation of network
- communication is a must. ZeroMQ (or $\emptyset$MQ) \cite{ZeroMQ} is a
- high-performance software library with support for a wide range of programming
- languages. A good basis for a future implementation could use this library as
- the basis for its communication layer.
- If an implementation of the architecture will be released, a good idea would be
- to do so within a community of application developers. A community can
- contribute to a central database of gesture trackers, making the interaction
- from their applications available for use in other applications.
- Ideally, a user can install a daemon process containing the architecture so
- that it is usable for any gesture-based application on the device. Applications
- that use the architecture can specify it as being a software dependency, or
- include it in a software distribution.
- \bibliographystyle{plain}
- \bibliography{report}{}
- \appendix
- \chapter{The TUIO protocol}
- \label{app:tuio}
- The TUIO protocol \cite{TUIO} defines a way to geometrically describe tangible
- objects, such as fingers or objects on a multi-touch table. Object information
- is sent to the TUIO UDP port (3333 by default).
- For efficiency reasons, the TUIO protocol is encoded using the Open Sound
- Control \cite[OSC]{OSC} format. An OSC server/client implementation is
- available for Python: pyOSC \cite{pyOSC}.
- A Python implementation of the TUIO protocol also exists: pyTUIO \cite{pyTUIO}.
- However, the execution of an example script yields an error regarding Python's
- built-in \texttt{socket} library. Therefore, the reference implementation uses
- the pyOSC package to receive TUIO messages.
- The two most important message types of the protocol are ALIVE and SET
- messages. An ALIVE message contains the list of session id's that are currently
- ``active'', which in the case of multi-touch a table means that they are
- touching the screen. A SET message provides geometric information of a session
- id, such as position, velocity and acceleration.
- Each session id represents an object. The only type of objects on the
- multi-touch table are what the TUIO protocol calls ``2DCur'', which is a (x, y)
- position on the screen.
- ALIVE messages can be used to determine when an object touches and releases the
- screen. For example, if a session id was in the previous message but not in the
- current, The object it represents has been lifted from the screen.
- SET provide information about movement. In the case of simple (x, y) positions,
- only the movement vector of the position itself can be calculated. For more
- complex objects such as fiducials, arguments like rotational position and
- acceleration are also included.
- ALIVE and SET messages can be combined to create ``point down'', ``point move''
- and ``point up'' events.
- TUIO coordinates range from $0.0$ to $1.0$, with $(0.0, 0.0)$ being the left
- top corner of the screen and $(1.0, 1.0)$ the right bottom corner. To focus
- events within a window, a translation to window coordinates is required in the
- client application, as stated by the online specification
- \cite{TUIO_specification}:
- \begin{quote}
- In order to compute the X and Y coordinates for the 2D profiles a TUIO
- tracker implementation needs to divide these values by the actual sensor
- dimension, while a TUIO client implementation consequently can scale these
- values back to the actual screen dimension.
- \end{quote}
- \chapter{Gesture detection in the reference implementation}
- \label{app:implementation-details}
- Both rotation and pinch use the centroid of all touch points. A \emph{rotation}
- gesture uses the difference in angle relative to the centroid of all touch
- points, and \emph{pinch} uses the difference in distance. Both values are
- normalized using division by the number of touch points. A pinch event contains
- a scale factor, and therefore uses a division of the current by the previous
- average distance to the centroid.
- % TODO
- \emph{TODO: rotatie en pinch gaan iets anders/uitgebreider worden beschreven.}
- \end{document}
|