Classroom gets clever

Researchers at Northwestern University are currently building a prototype of an AI-based ‘Classroom’ that tries to serve as its own A/V assistant.

Researchers at the Intelligent Information Laboratory at Northwestern University in the US are currently building a prototype of an AI-based ‘Classroom’ that tries to serve as its own A/V assistant.

It watches a speaker as he lectures and listens to what he says. If his actions indicate that the Classroom ought to do something to assist, the Classroom does what it deems appropriate.

In the Intelligent Classroom, the researchers, led by Professor Kristian J. Hammond, are enabling new modes of user interaction through multiple sensing methods and plan recognition. The Classroom uses cameras and microphones to determine what the speaker is trying to do and then takes the actions it deems appropriate. One of the goals is to let the speaker interact with the Classroom as he or she would with an audiovisual assistant: through commands (speech, gesture, or both) or by just giving a presentation and trusting the Classroom to do what is needed.

One way the Classroom assists the speaker is by controlling AV components such as VCRs and slide projectors. In addition, the Classroom lets speakers easily produce fair-quality lecture videos. Based on the speaker’s actions, the video cameras pan, tilt and zoom to best capture what is important.

In order to interact with the speaker effectively, the Intelligent Classroom needs to know where the speaker is and whether he is making any gestures. The Classroom is equipped with a number of cameras from which it extracts images that it then examines for salient information. Fortunately, since the Classroom knows what a speaker is likely to do (and often what he is currently doing), it is able to use information about the current situation (the context) to make the computer vision task easier and more accurate.

In general, computer vision is made tractable by using special purpose visual routines that depend on the given context. The Classroom is no exception to this rule. To maintain the flexibility that the Classroom needs, as well as the robust accuracy required of the tasks it is put to, the Classroom uses a run-time configurable vision system called Gargoyle. Gargoyle provides an environment that can be programmed to take into account the current context for the given situation, and then be quickly reconfigured for a different visual task or context change as the visual situation inside the Classroom changes.

In order to be able to have a visual system that adapts to the given situation like this, it must be controlled by a reasoning system which sits on top and reasons about its operation. Each of the visual routines that the Classroom uses has specific information that it can extract from the scene, and specific constraints on when it can operate. In the Classroom this information is explicit, allowing the reasoning system to determine when the different visual routines are appropriate. By selecting different routines to run, the researchers are able to achieve more general purpose, as well as more robust, vision.

For example, the Classroom could switch routines to acquire different information about where the user is, or what he is doing. Alternatively it might switch routines to acquire the same information in a different way, if the constraints for the first visual routine fail. The result is a system that is able to extract a broad range of information from a scene by focusing on different specific elements as needed by the goals of an execution system it serves.

Some example visual routines that the Classroom can currently configure include person tracking and hand drawn icon recognition. There are a number of different methods that the Classroom can use to accomplish these tasks given the current context. For example, in order to track a person in the room, the Classroom can use background subtraction techniques in order to get the segmentation shown. If the person were to wander out of the field of view, it could rapidly reconfigure the pipeline to track by colour instead.

By using various visual techniques that are robust in given contexts, the Classroom is able to accomplish a very general vision task. It is only able to accomplish this through reasoning at a higher level about how it is sensing the world.

To effectively cooperate with the speaker, the Intelligent Classroom has to act appropriately at the right moments. So, the Classroom must first understand what the speaker is doing and then carefully synchronize its actions with the speaker’s. For example, when a speaker goes to the chalkboard to write, the Classroom has to use two very different camera techniques: one for when he walks and the other for when he writes. If the Classroom uses the walking technique while the speaker is writing, people viewing the video feed won’t be able to read his writing.

To address this challenge, the Classroom uses plan representations that explicitly represent the speaker’s actions, the Classroom’s actions, and how they should fit together. These plans are intended to represent a common understanding of how a speaker and an A/V assistant would interact. When the speaker is doing something, the Classroom monitors his progress through his part of the plan, waiting for the moments when the Classroom needs to act. For example, the ‘walk over to the chalkboard and write’ plan has a process (sequence of actions) for the speaker’s actions of moving to the board, stopping at it, beginning to write, and finishing. It also includes processes specifying how the Classroom should film the speaker and adjust the lights. Finally, the plan states that the camera technique should start changing as the speaker enters the chalkboard’s vicinity.

The Classroom also uses these plan representations to reason about the speaker’s actions at a higher level. While the speaker gives a presentation, the Classroom monitors the processes that serve as its understanding of the activity in the environment. These include processes for both the speaker’s actions and the Classroom’s actions (such as playing a video or showing a slide). When the Classroom observes the speaker taking an action (such as walking, gesturing or speaking), it tries to explain this action in the context of its understanding. That is, it looks through these processes for one that predicts that the speaker will perform that action.

However, if the Classroom doesn’t find such a process, it must revise its understanding of what the speaker is doing: the speaker apparently isn’t doing what the Classroom thought he was. The Classroom then hypothesizes new processes that explain the speaker’s action. Initially, there may be several candidate explanations, but when the speaker’s future actions contradict some of the proposed processes, they can be rejected, eventually leaving just the speaker’s actual activity.

The researchers claim that there are lots of interesting future directions the project could take. They could, for example, build an analogous system for watching an audience too.