Multi-Language Big Data and the Role of Linguistic Technology: A Spotlight Q&A with Jonathan Litchman of SAIC
by Ron Powell
Originally published July 31, 2012
This BeyeNETWORK spotlight features Ron Powell's interview with Jonathan Litchman, SAIC Senior Vice President. Ron and Jonathan discuss how governments and enterprises throughout the world are increasingly challenged by big data the need to understand and communicate with other cultures and other languages.Jonathan, could you give our readers a brief overview of SAIC, what you're doing in the area of linguistics, and why this is so important today?
Jonathan Litchman: SAIC is a Fortune 500 scientific engineering and technologies company that does work in energy, health, cyber and national security, largely for the government but also for commercial. We are actually one of the largest providers of language services to a number of different federal areas – intelligence, defense and law enforcement.
As a result of the technology that's behind our Omnifluent product line, we're a commercial vendor of linguistic technology, and we have a platform that does linguistic translation and automated interpretation of both text and speech. In our widely expanding globalized world, being able to communicate effectively and understand across languages and cultures has become a compelling need for both government and commercial enterprises. It's a huge worldwide global market. Common Sense Advisory, which is a market research firm that looks at this sector, estimates that the annual global market is in excess of $33 billion. What we're seeing is the need for communicating effectively. Translating from one language to another is expanding incredibly fast, much faster than human translators and interpreters can keep up with the demand.
How would you describe “big data” in the area of linguistics and translation software?
Jonathan Litchman: Big data, as you know, is a term used for lots of different things. When I think about big data, it depends on how big you want to get. If you think about the vast amounts of data that people need to be able to handle in only one language, you have tremendous big data issues; but if you understand that the most effective use of big data is to be more inclusive and make that big data more global, then you have a situation in which your data increases exponentially with the inclusion of multiple languages within that dataset.
I've heard many definitions of big data but when you say big data is the incorporation of more than one language, it just makes me feel that the data is a lot bigger.
Jonathan Litchman: It provides a perspective on how much information is available that can be really useful, and it isn't all in just one language.
Can you tell me more about your Omnifluent product?
Jonathan Litchman: Our Omnifluent products help people who want to do analytics and mining on big data be able to do so without having to confront the barriers that different languages pose. Whether it's multilingual search, translation summarization, or automatic alignment of a transcript with video or audio, big data has to expand beyond single language capability in order to be able to understand what's useful within that big data.
There are several features of the product that I think are special. The first is that the translation technology that underlies the Omnifluent platform is really a true hybrid machine translation capability. It's a combination of machine translation that includes rules-based and statistical engines, each of these engines working together as one within a single decision engine. An even more interesting feature of this linguistic platform that Omnifluent has is that in addition to that hybrid nature of translation, it also unifies text as well as speech on a single platform. Omnifluent provides automatic speech recognition and machine translation in a hybrid approach that fuses all of these components together, sharing linguistic resources to avoid the problem of compounding errors that result from integrating different pieces technology. Since these sit on a single unified platform, they share all of the linguistic resources to provide the best possible output.
If you're combining results from both a rules-based engine and a statistical translation engine, how does your platform determine which result to use?
Jonathan Litchman: Well, that's the special nature of the decision engine itself. Our true hybrid technology can parse all the way down to the individual word level, not just the phrase or sentence level, and it is able to make a decision on which is the best output relative to what the communication is trying to say. There are times in which one word is better than another word as a result of the way the communication is written or the nuance of the meaning that needs to be communicated. The decision engine will evaluate each of the outputs together and fuse them into a single optimal output. A sentence may be actually a combination of outputs from the machine-based and the rules-based engines simultaneously.
That's excellent. Our audience is very focused on B2B and the enterprise level. Why do you feel SAIC is uniquely qualified to help individual corporations with their global and industry-specific translation needs?
Jonathan Litchman: There are really two reasons. The first is that we at SAIC actually use this technology and this capability internally for our purposes. We're a provider of the service, but we're the first user of the service as well. We see the positive impact that it has in terms of lowering our translation costs and speeding up the amount of data that people need to have translated in order to do business. The second is that we have more than 600 high quality linguists on staff as part of our language service offerings, and we're able to leverage those human experts with computational linguists and software engineers to produce a product that is most effective from a linguistics standpoint. More importantly, because the linguists themselves have a role in the development of the technology, we are highly confident that the end result passes muster.
In addition, we've done benchmarking both for ourselves and for some of our external customers in the commercial enterprise space. What we've seen is that this is a very effective tool for them in a number of ways. Because we are able to provide this service both as a secure SaaS capability as well as on-premise, we can provide them individual, tailored language models that ensure their data is completely secure and that they maintain absolute ownership of their proprietary data. That's an important function in terms of international business – the protection of IP, the protection of business secrets. Because of the nature of our software as well as the services that go along with it, we really complement those concerns that enterprises have.
Security is always important. What languages are in greatest demand for translation? For example, what language is most requested in the United States?
Jonathan Litchman: In some ways it's not a tremendous surprise, but in the United States, Spanish is by far the most requested language in terms of translation. There's a huge multilingual population in the United States, and there are a lot of documents, media, and other kinds of communications – healthcare documentation in particular. We want to ensure that in our very multicultural population there's the ability to communicate effectively, Spanish far and away is the language of choice in the United States. However, as the United States is involved in globalization, Asian languages – particularly, Korean, Japanese, and Chinese – are increasingly in demand because that's the direction the global economy is taking us.
Could you tell us about Omnifluent Media and Omnifluent Talk and how they benefit your customers?
Jonathan Litchman: Omnifluent Media is a product line that really supports two aspects of media. The first is automated and aligned development of closed captions and subtitles. What this does for media content generators is provide a rapid and less expensive ability to produce both closed captions as well as subtitles to be able to reach larger numbers of an audience, whether it’s reaching an audience in another language or following closed captioning requirements to reach people who are hearing impaired.
On the other side of the content role, which is the content users, the technology itself develops what we call rich metadata, which is an exact transcript of the media content itself. That allows people to perform advanced search, semantic search, and data analytics related to what is in the content itself. It’s a platform both for content producers and content users who want to gain the most information out of that data.
Omnifluent Talk, on the other hand, is really our API for developers. It allows people to develop their own technology and applications and incorporate our linguistic platform as a part of their service. More and more we're seeing that problems related to business, information, understanding, or communication are going to have a multilingual component, and we're offering this platform as a development tool so application developers can incorporate this feature into their product.
How easy is it to integrate Omnifluent Talk into an application?
Jonathan Litchman: It couldn't be easier. It's a matter of just taking our API and integrating it into the software development process that you have.
Do your products replace human translators or would a company still need a human translator to validate and verify the results?
Jonathan Litchman: That is a great question. Technology and linguistic technology will never replace humans in any case when you need to have a translation be as close to perfect as possible. What this technology is most effective for is really two things. The first is it's really a human productivity-increasing technology, very much like spreadsheet technology was when it was first introduced to the accounting community. It's a tool that enables people to do the work that they do better and faster, but never replaces a human at the end.
The second thing is that there are a lot of cases in which it is simply impractical or impossible to have a human present when you need to communicate effectively from one language to another. In that case, the technology can provide the ability to communicate across languages without the addition of a human when it is appropriate to do it that way. There are just times in which it just doesn't need to be at human perfection level, and this gets to a quality that is really very effective in making communication meaningful.
Could you give us an example?
Jonathan Litchman: Let me give you two examples within one scenario to show these two different things. When you have a healthcare community, you're dealing with the administrative burdens of patient intake and the first line set of questions that are critical for access to healthcare but are not so critical in terms of life or death decision making. In the first case with patient intake, making the language barrier less of a problem for ensuring that patients are able to come in, fill out paperwork, have correct and accurate health records set out right from the beginning, this kind of technology – whether it's on a mobile device, a laptop, or a desktop computer – is really very, very effective both in terms of accuracy and cost effectiveness. But when that transitions to a very serious and important dialogue with a physician over very specific kinds of questions that may be life or death related, that's when you absolutely want to make sure that you have a person involved to make sure that there are no mistakes of understanding or communication whatsoever.
That makes a lot of sense. What has surprised you the most over the last few years in working with translation?
Jonathan Litchman: What has surprised me the most is the quality of the output of translation, whether it’s text or speech, through the use of tailored language models. I’m not talking about out-of-the-box translation capability, but about that extra step when you really create the language that you're going to be using whether it’s neuroscience, engineering of a particular type, or point-of-sale activities for a retail outlet. The quality increase that you can get from this technology for the output has also surprised me. With that kind of tailoring, it really can almost approximate a human.
When I look at Google, that's a very generic type of translation. What you're saying is you have specific models across various industries so it actually talks the language of the industry?
Jonathan Litchman: That's exactly right. There's absolutely a place for easily and freely available general language models. Google is a perfect example. Microsoft is another example of these kinds of things. But as a general rule, what we have found is that when you want to be all things to all people, you can't ever achieve the level of quality output that you can when you're really being very specific to the particular user need. Language is a complex thing, and it's always evolving and changing. That can easily be seen when you look at how teenagers spoke 20 years ago and how they speak today. It's the same language, English if you will, but it's very different. Word choice is different, word meaning is different, and if you were to take a word-for-word translation of one cohort of teenagers from 20 years ago and compare it to today's teenagers, you would really have a problem with translation. That is also true with professional domain languages as well. The language that a particular company may use is different from another company, even in the same vertical business line. Certainly, the language between one vertical business line and another is very, very different. They may even – and we've seen this – use the exact same words and have very different meanings. If you had a general language model, you would use the one that's most common at the output, but the most common may not be the right one for that particular user. That's why we see such positive returns on our tailored language models for specific business users.
Could you give us an example of some of the tailored models? Do you work with automakers?
Jonathan Litchman: We do work with the auto industry, which has its own tailored model because the word choices and the types of things that automotive companies are working with are different than, for example, oil companies. Another area in which we've seen our tailored models work very effectively is in the hospitality industry. When you're dealing with questions related to dining and to hotel use, it's a very specific language, and the same is true with fashion retail. We've seen incredible improvements with fashion retail. That's a set of language models that changes almost seasonally in terms of colors, the names of products, and so on. Language is very dynamic within all of these different industries. These industries share only the most basic common attributes with one another. They have so many pieces of language that are unique to their industry and sometimes even their own company.
That makes a lot of sense. So what is your greatest challenge in the market today?
Jonathan Litchman: I think the greatest challenge centers around expectations of what the technology in the Omnifluent platform – and, quite frankly, any technology in the translation and interpretation business – can provide. This is a technology that has been developed in the past 50 to 60 years, and the promises over that time have marginally exceeded the reality. I think that there is a false expectation about language. Everyone speaks their own language so they assume that communication and language are easy, but it's not so.
The challenge for us really is to talk about these tailored models and how even if you speak the same language – English, Chinese or Spanish – the language within that language is different from one area to another. What you should expect from this technology and how it's used as a productivity tool as opposed to a human replacement is another great example of the challenges that we face every single day. As an industry, when we’re able to align correctly customer expectations with the real value that this technology provides, I think you're going to see an explosion of its use by translators, interpreters, and non-multilingual people themselves.
Jonathan, thank you so much for providing us with this insight into SAIC’s translation and linguistics technology and the challenges companies face today because of the necessity to gain insight from data in many different languages.
Recent articles by Ron Powell
Copyright 2004 — 2020. Powell Media, LLC. All rights reserved.
BeyeNETWORK™ is a trademark of Powell Media, LLC