Welcome to MSDN Blogs Sign in | Join | Help

HCI - Human Computer Interaction

May 30, 2009 - 1759 hrs

'Clairvoyant' - Information Retrieval via Common Multimedia Devices

Introduction

Man is a social animal that has a curious mind, sometimes we want to know almost everything that comes in our way, but do we really? How many times have we visited a new place that has monuments or some historic importance but we don’t find it out? How many times we visit a place but don’t know what it is called? how old is it and why is it there? Example could a weird looking rock that is hard to understand the importance of, e.g. observe the pictures below.

                               

What is this rock?                                What building is it?                This is on TV, Who is he?

Many times we miss important and interesting details about certain places, voices and atmosphere. This paper is about unleashing such potential to keep ourselves and our young kin up to date about almost anything we can see, hear or observe using basic multimedia devices that we use in daily life, such as camera used to take these pictures, videos or a voice recorder. We’ll talk about enhancement possibilities in these devices that allow us to reach our goals by locating information from the behemoth information available in the world [wide web] and deliver it to us in the form of text, audio and video.

Hardware

This idea is about introducing multimedia device that acts as a point and shoot camera, a video and audio recorder, has a touch LCD screen and a speaker. So if you notice, hardware wise, there’s nothing more than a digital camera in it. However the capabilities include much more. The device will additionally have some core components including

1.       A state-of-the-art ASIC (Application Specific Integrated Circuit) to create a map of pixels captured,

2.       A CPU to compute pixels match (both of the aforementioned could be combined too),

3.       A Wide WAN device powered by SIMM (Subscriber Identity Module) card for GPRS or 3G based internet connectivity, so the integrated device will work as a phone as well if voice module is also added.

The device might as well be a modern age cell phone!

Functionality

The device, let’s call it “clairvoyant” for simplicity purpose in this article, will capture the image and try to locate the pixel map with an encyclopedia of image inventory on the web. We can possibly use data from Encarta plus user submitted image data equipped with its meta information text. Without diverting much from the main topic, we can expose an internet application that accepts user submissions about peculiar or known media information that will go to a repository that “clairvoyant” will use as a lookup resource. User driven automatic repository building will enable the end users to get unrestricted and wider variety of information.

Because the ability to match pixels of an image is already under development by Microsoft Live Labs ‘Photosynth’. Similar programming logic can be used either in clairvoyant’s firmware OR the server component that does the pixel match and sends back the information to clairvoyant’s User Interface about the photograph sent. It allows real time ability to locate detailed information about a picture we just took! Similarly an audio commentary can be played on-demand or instantly when a snap shot has been taken.

Next is our challenge to record a video and try to find some information about it on the fly. We can have multiple ways to solve this problem, the one we can think of instantly is the functionality to scan the video and extract stills (snap shots) at regular intervals of configurable seconds. Open connection to the internet, send these extracted snap shots as a batch to the web server, and web server computes the matches, a sends back a pile of information back to the clairvoyant client, render in the LCD. Voila!

Similarly we can send a recorded voice or captured audio about something or someone and send it in chunks to the server, server integrates the chunks together, or uses disintegrated data to find information  related to sent input using speech technologies (we already might have some speech and machine learning initiatives un MSR that can add value). Matching audio can deliver content back to clairvoyant such as speaker, context of the audio including remaining parts of the audio, prelude and interlude etc.

Perks and additional functionality

What do we gain besides customer delight? And how does it add value to our bottom line? Is it worth investing time and money? OK, there is a one big answer to these questions, Ads!!

There could be many different kinds of ads that can be delivered to the user using this device and service.

1)     Contextual text ads

2)     Audio commentary preluded with an advertisement

3)     Image overlay with some relevant ad

4)     Video altered with video-ads embedded, for instance highlighting parts of the video on  tap/touch and showing a balloon with ad

Advertisers will be able to use Microsoft adCenter UI to choose a segment (based on clairvoyant repository), and thus browse through media content they find relevant and popular, to bid on images, videos and audio clips.

Putting it all together

In order to get such functionality that is useful to people of all ages and disciplines, we can either create a device from scratch or extend smart mobile phones to use their horsepower and expand the services in it. Going back to the pictures on the first page, I’d stop on the roadside while my Utah excursion, take picture of that rock, and within seconds, I’ll come to know the name of this rock, when and how was it created and its historic importance. Second, while driving to Salt Lake City, I take this picture of a building in upcoming downtown, get its name, important offices in the building, year built and material used etc. My mom, while watching TV, takes a picture and her ‘clairvoyant’ device tells her, that the man in the picture is Microsoft CEO, Steve Ballmer.

Conclusion

In the end, we can conclude that creating such smart device or equipping existing common multimedia devices is within our reach, we have some wisdom already and some research in user experience as well as technology needs to be invested time in.

_____________________________________________________________________________________________________

April 21, 2009 - 6:45p PST

Untamed Natural User Interfaces / Augmented Reality

Natural User Interfaces (NUIs) are the ways users interact with computer systems using their regular gestures, hand movements, facial expressions and eyeball movements. In the past decade we have see some significant progress in development of NUIs including mouse, pen, touch/tap and camera based physical gesture capturing. The devices that do expose NUIs are still not mature enough to cater needs of all kinds of users not only because of room for technical advancements but also feasibility and reach of these devices.

The subject of this paper is “untamed” NUIs, the areas which still need to be explored and can provide tremendous ease in usability and adaptability of computer based systems.

Potential NUIs in the areas of applicability are discussed below:

In public facilities

Public facilities such as hospitals, banks and restaurants will be able to provide natural user interfaces to their users to provide high availability of services and productivity to their employees. E.g. a doctor will be able to diagnose a patient or setup appointment remotely via smart video conferencing integrated in the normal television of the patient. In the doctor’s office, which will seem empty, the diagnostic capabilities will be integrated within the NUI enabled walls, which will capture details about the patient and be able to recommend prescriptions in the real time to the doctor by looking up the symptoms and matching medicines from the repository. A patient lying in his/her hi-tech hospital bed hooked up with central computer system will have automatic reminders in the form of vibrator or alarm to take their medicines, and the corner table will have electronically enhanced pill cases that will illuminate on the table with information about the medicines inside and the frequency to take them etc. That means patient doesn’t have to read or guess what the meds are for, rather a circumfence or “balloon” dialog will appear next to it on the table describing the purpose and more information. Via this electronic table on their bedside, patient will be able to interact with hospital facilities, check restroom status, read electronic visitor call log, request meeting with doctor or nurse, schedule their release timing, all without involving any other human’s time.

In the enterprise

The next generation information worker should be able go to his/her desk and use the desk itself as their input and output device. For instance, push of a button will light up the glass desk asking for user’s credentials to log in. But even in current age, login can be driven with biometric devices such as fingerprints. However the future state will include biometrics from user’s facial view, eye structure and movement, voice recognition with ability to recognize the voice from a natural resource/human by capturing the shaken molecules from atmosphere to differentiate the recorded voice from an electronic speaker. Touch of a finger on different part of the user’s desk will produce a virtual keyboard on the desk itself, which can be dragged on to different parts of the desk with just natural hand movements. Similarly dragging finger around will work as mouse movements similar to currently seen on modern tablet PCs. Voice recognition will play a major role in inputting commands to the system, such as writing code, compiling a program, ordering supplies, commanding to print, email and phone. Dependencies of different components, controls or items on the literal “desk”-top will be able to interact with each other efficiently enough. In order to gauge the employee productivity, all of the user’s actions will be instrumented using gesture capture technology. That means each hand movement, clicks, taps, drags and access will be recorded on the runtime via instrumentation web service calls. This data can mined to generate reports, compare employees’ productivity and provide feedback. All the actions that were recorded can be replayed by simulator software to observe or share the process.

At home and on the go

Other place where most users spend their time is either at home or in their car. These will be formally equipped with NUI based devices that will follow orders via human gestures, voice recognition, pen and touch. The future billboards will have ability to scan the user’s identity and present personalized information and advertisements. An example of this future visionary thing can be found in movie Minority Report. Household portable devices like camera will have smart chip and Wide WAN (Wide Wireless Area Network) that will allow us to take pictures of the objects and it will send this picture data to the could (internet server) and match pixels of the picture with the repository lying on the server and retrieve information that could be useful to the user. Practical application of this device could be that I take picture of a monument, while site seeing in Utah, and my device turns the LCD display into an encyclopedia ranting about the monument. Similarly, I’ll take picture of a person on TV, and it will give his/her name and biography etc. The device can provide textual, speech or video information. A highly advanced application of this device could be to shoot a video and on the runtime a commentary will play describing the location being shot. The device has to divide video stream into smaller chunks of stills and extract relevant information and then join it all together to make a speech. These devices can integrate personalized and contextual ads for the benefit of users and the advertisers.

Shopping for clothes and other materials could move into the next generation, where users will be able to navigate to sites offering such material and the computer will be similar to that described in the ‘in the enterprise’ section above which will be on a table or a desk, but with a very special material that has capability to change its texture. That means I will be able to touch the fabric of a shirt show in the vendor site, and actually ‘feel’ the fabric and its material because the tabletop computer will allow the surface to change the texture. Microsoft Surface Computing (www.surface.com) is a significant leap ahead in the NUI innovation.

Household appliances will be easily programmable via natural gestures and as easy as posting a note on the refrigerator. For instance, a vertical display computer will take place of (or be integrated in) the refrigerator door that allows interaction with human touch. A housewife will be able to write a note or reminder in plain English/other languages, to instruct the household computer system to do laundry at 1p, send congratulations greeting card to her sister at 12a, close garage door, switch off electricity for a part of the house that is not being used and much more. She can write in her handwriting electronically and machine will translate it to command text and execute. Similar a user interface with touch commands will be provided to program without handwriting recognition.

Cars are already becoming advanced (e.g. Microsoft Sync Technology) to listen to human voice to operate actions. The car’s CPU will be able to be connected with home’s central PC, in which we can program what songs to listen to (dynamic playlist) tomorrow when driving to work or vacation, gasoline and food break reminders, remote start and turn off controlled by network based access e.g. starting the car while we and kids are eating breakfast, now the cars is warmed up. Turn off or stall the car’s CPU remotely when the car gets stolen via the network based access on mobile phone or desktop. Another NUI application for the cars could be to turn the windshield into a large translucent display monitor that has touch capabilities and it can provide GPS navigation by filling the lines in actual roads and streets. It can identify the locations that are surrounded by car, speak up information about those whichever is pointed finger gesture at, and alarming the driver for any unusual activity.

____________________________________________________________________________________

March 30, 2009 - 5:25p PST
Documented Observation #1 - Lenovo Laptop Middle Mouse Button

First I must start by saying that IBM/Lenovo laptops are very stable, durable and high performance machines. Few months ago, I got an X61 Thinkpad. I was readily impressed by its sleel looks and compact size (I'm big fan of lightweight and small machines). Small didn't mean less power. It has 4GB RAM and dual core 64bit processor. I would like to share some of my expriences with it. Lenovo Thinkpad X61

OK now the good part of this post. There's no touch pad on it :) and what you have is a small red mouse button in the middle of the keyboard. My initial thoughts about it were sarcastic. And while using it for first few days, I kept rubbing my thumb on the area below mouse-left and mouse-right buttons. But hey, there's nothing there to scroll or move mouse. Now I had to use my index finger to caress the red mouse button and click the mouseleft button for left clicks with my thumb. I came from using Sony VAIO where I would simply "tap" my finger to do a click. However, the good thing is that mouse-left button (or mouse-right) for that purpose doesn't really make much noise (unlike my Sony VAIO SZ series). My first experiences with this obscure mouse button were not too good.

Now the second problem with this middle-mouse. In "normal" laptops I could scroll page(s) by sliding my finger in the right or bottom corners of the touch pad for vertical and horizontal scrolling respectively. But since there's no touch pad on it, there is a button in the middle of mouse-left and mouse-right buttons, which can be pressed (kept down) and red middle-mouse button moved to scroll in either direction. It works very similer to pressing scroll wheel of a traditional mouse and then moving the mouse. But not impressive because if I don't have a proper driver for it, it would move (scroll) the page in mutliple directions. So there exists a driver that lets me scroll in either horizontal or vertical directions instead of having a free hand movement.

OK now let's talk about the cognitive learning part in using such devices. Sometimes this learning could actually be better for ergonomics of the human computer interaction. After couple of months of playing with the middle-mouse button, I realized that I'm having to move my hand much less, and my fingers (for the purpose of mouse movements) don't have to move much. Which made me realize that my hands aren't getting as tired as they were with a touch-pad based laptop. Not to mention, there was some extra time involved in placing my fingers (thumb and index) to the right place before I could go start moving my mouse and clicking etc. It's a bit different than just moving index finger on any part of the touch-pad. It's worthwhile to notice that IBM/Lenovo probably didn't put a touchpad in this computer due to its size (12 inch screen) but the bigger Lenovo models contain both a touchpad and the middle-mouse.

Sony UXActually because I did talk about Sony, I should also mention about an ultra mobile PC I used couple of years ago (I'll write in detail about it, later). The similarity between the Lenovo X61 and Sony UX Series is that UX also had a similar mouse button in the top right corner of the base of the device. BTW it's a touch screen tiny computer with only 4.5 inch screen and it runs Vista.

Conclusion: I think middlemouse button is bit odd to start with but it certainly provides better ergonomics when used for a longer bit of time.

Published Wednesday, March 11, 2009 1:19 PM by samar

Comments

No Comments

Anonymous comments are disabled
 
Page view tracker