Feature Articles: Media System Technology for Creating an Appealing User Experience
Media Processing Technology for Achieving Hospitality while on the Go
This article introduces a service concept of hospitality on the go, in which users are guided while they move around a local area, as well as statistical machine translation technology and robust media search technology for supporting such services.
Keywords: statistical machine translation, robust media search, hospitality
With the year 2020 in mind, NTT’s goal is to implement a navigation service that can provide foreign visitors to Japan, who are moving about a local area, with detailed guidance according to the user’s attributes and situation. This concept is described in detail below.
1.1 Service for guiding people to their destination in unfamiliar places
In recent years, the translation of Japanese guidance information into other languages on information displays in public transportation facilities such as train stations has been progressing. Nevertheless, information that can change at any time due to delays or accidents cannot be translated in advance. Moreover, visitors from other countries do not have an intuitive understanding of the local geography, so simply translating the names of places and exits that appear on signs does not help them decide which way to go.
To achieve a more effective navigation aid, detailed information presented in Japanese is translated in real time by using multilingual statistical translation technology, and appropriate guidance information is selected based on an estimation of the user’s situation (Fig. 1). Robust media search (RMS) technology, which can recognize objects that the user sees around them, and various other types of recognition technology are used to estimate the user’s situation. For example, when a train station employee inputs emergency information in Japanese, that information is immediately translated and sent to the smartphones of foreign visitors who are in that station, and digital signage or message boards can be translated and displayed on smartphones held up to the boards by users. It would also be possible to display navigation instructions to destinations in various languages by interworking with smartphones.
1.2 Tourist navigation service based on “What I can see now”
At tourist sites and places being visited for the first time, the surroundings as seen by the user are captured with a smartphone or head-mounted device, and that information is used to provide location-based guidance. Video of scenery can include various photographic angles and objects in the environment. RMS-object is a specialized type of object recognition technology for recognizing photographic subjects. RMS-object can be used to discover multiple objects from different viewing angles and environments with higher accuracy (Fig. 2). Combining RMS-object with technology capable of estimating the user’s situation makes it possible to provide guidance according to the user’s location at the time by selecting and displaying what is appropriate for the user’s attributes and situation from the information available on the discovered objects.
2. Media processing technology to support hospitality on the go
NTT is moving forward with research and development (R&D) of statistical machine translation and RMS technology to be applied in implementing hospitality on the go services.
2.1 Statistical machine translation
As use of the Internet increases and the reach of globalization spreads further, the need for language translation done by computers, known as machine translation, is also increasing. Work on machine translation to eliminate the barrier of language, including work on a national level, is accelerating as we look toward 2020, and expectations are high.
R&D on machine translation has a long history, and many machine translation systems have already been developed. Nevertheless, existing systems have not really met worldwide needs and expectations, so an innovative advance in technology is needed.
Conventional machine translation systems that use a rule-based translation approach require years of work by many experts to manually produce translation rules and bilingual dictionaries for translation of a new language. Such systems have already reached the limit of accuracy of manual work, and thus, in recent years, a different approach called statistical machine translation has become mainstream. In this approach, a statistical model that is equivalent to translation rules and bilingual dictionaries is learned automatically from large-scale bilingual data on the order of several million sentence pairs.
The outline of statistical machine translation is illustrated in Fig. 3. It achieved at an early stage a practical level of accuracy for language pairs that have very similar word orders such as English and French. For languages that have greatly different word orders such as English and Japanese, however, it was not able to outperform the conventional rule-based translation.
NTT has devised a method in which statistical machine translation is applied after reordering English words into Japanese word order. The reordering of English words is based on the single Japanese linguistic property of head finality . For the first time in English-to-Japanese translation, we achieved a result in which statistical machine translation outperformed rule-based translation in accuracy .
The concept of word reordering based on the Japanese head-final property is illustrated in Fig. 4. The word that determines the grammatical role of a phrase in a sentence is called the head; or, as is learned in Japanese elementary school, the word that is modified by others is the head. In Japanese, the dependency always goes from left to right, which is to say that the modified word is always placed at the sentence-end side. The term that describes this relationship is called head finality. Therefore, if we reorder the words of the translation source language (English or Chinese) so that its dependency always goes from left to right, the resulting word order becomes the same as the Japanese word order. If the word order is the same, highly accurate translation is possible through literal word-by-word translation.
Translating from Japanese to foreign languages (English or Chinese), on the other hand, is difficult because we must select the dependency relations in the input Japanese sentence that should be reversed from right to left based on the word order of the target language.
NTT has devised a translation method for changing the word order of Japanese sentences to that of the target language based on the predicate-argument structure of Japanese . The predicate-argument structure is the grammatical relationship between a verb and nouns, namely, which noun is the subject of the verb and which noun is the object of the verb. As is learned in English classes in Japanese middle schools, English has an SVO (subject, verb, object) word order, while Japanese has an SOV (subject, object, verb) order.
Therefore, we first identify the predicate-argument structure of the Japanese sentence and then change the order of the bunsetsu phrases (roughly corresponding to noun phrases, prepositional phrases, verb phrases, etc. in English) to convert the Japanese SOV order to the SVO order of English. Because English and Japanese have an opposite word order within a bunsetsu phrase, the next step is to reorder the words in the Japanese bunsetsu phrases to match the English order (e.g., Åìµþ¤Ç→in Tokyo). This approach reduces the number of word ordering errors in statistical translation from Japanese to English by about 30% relative to the conventional method.
The Multilingual Statistical Translation Platform (PF) was developed with the machine translation technology described above. The platform currently handles translation from English, Chinese, and Korean to Japanese and from Japanese to those languages. In addition to the main translation function, the platform provides a user dictionary function, an unknown language detection function, and other functions that are needed for business applications. Functions for user convenience, such as support for creating statistical models, which is difficult for ordinary users, are also implemented.
The quality of statistical machine translation depends on the amount of data used to train the statistical model. We achieved high-quality in translating patents by using this platform with large datasets of corresponding sentence pairs that we created from patent documents for English and Japanese (about 17 million sentence pairs), Chinese and Japanese (about 8 million sentence pairs), and Korean and Japanese (about 2 million sentence pairs). Replacing the data used when training the statistical model makes it possible to automatically construct high-quality machine translation systems for particular domains other than patents.
The development of innovative machine translation technology and its implementation in a system as described above has laid the foundation for meeting worldwide needs and expectations. In the future, we will aim for even higher accuracy and a wider range of target domains to achieve machine translation that truly removes the barrier of language.
RMS is technology for using small parts of video or still-image signals from a camera or sound signals from a microphone as keys to search a large database that contains videos, music, and still images of landmarks (Fig. 5) [4, 5].
This kind of matching-based media search has been the subject of R&D by NTT laboratories for over 20 years, and the results have served as the core of various services, including net monitoring for investigating the use of video on the Internet, music use listing for automatically creating lists of music used in broadcast programming, and second screen for displaying network content related to broadcast programming on smartphones by capturing audio or video in the programming.
RMS is robust against ambient noise or obstacles, distortion in video, and interruptions in audio, and it can also search huge amounts of media data instantly. For example, it is possible to identify the name of a song that can be heard amidst street noise. For video, it is possible to quickly and accurately identify objects that are partially hidden and cannot be seen in their entirety. Because RMS uses video and sound rather than textual information, it is possible to identify what is seen or heard when the names of those things are not known by the user or when text input is difficult. For hospitality while on the go, this function can be used to recognize objects that can be seen in the surroundings and to display appropriate information according to the user’s attributes and situation.
We are currently working on increasing the speed and accuracy as well as the ease of use of RMS to enable fast and accurate searches at the moment and on the spot. In the future, we will continue to study actual use environments and to do basic research on media search technology.
3. Future development
The idea of hospitality on the go places importance on understanding the user’s situation and intentions, which is one of the main elements of our vision of a personal agent. People who visit Japan find it difficult to gather information in public places because of language differences and indecipherable signs and information displays. Our approach to overcoming this problem is to provide that information in different forms that suit users and to display it in useful ways.
Our goal for the future is to implement services to provide a better user experience by going beyond the translation and video search technology described in this article through interworking with technology related to geographic data and other important technical elements.