The issue of data security and machine translation

Kirti Vashee 12 Dec 2019 7 min read

As the world becomes more digital and the volume of mission-critical data flows continue to expand, it is becoming increasingly important for global enterprises to adapt to the rapid globalization, and increasingly digital-first world we live in. As organizations change the way they operate, generate revenue and create value for their customers, new compliance risks are emerging — presenting a challenge to compliance, which must proactively monitor, identify, assess and mitigate risks like those tied to fundamentally new technologies and processes. Digital transformation is driven and enabled by data, and thus the value of data security and governance also rise in importance and organizational impact. At the WEF forum in Davos, CEOs have identified cybersecurity and data privacy as two of the most pressing issues of the day, and even regard breakdown with these issues as a general threat to enterprise, society, and government in general.

While C-level executives understand the need for cybersecurity as their organizations undergo digital transformation, they aren’t prioritizing it enough, according to a recent Deloitte report based on a survey of 500 executives. The report, “The Future of Cyber Survey 2019,” reveals that there is a disconnect between organizational aspirations for a “digital everywhere” future, and their actual cyber posture. Those surveyed view digital transformation as one of the most challenging aspects of cyber risk management, and yet indicated that less than 10% of cyber budgets are allocated to these digital transformation efforts. The report goes on to say that this larger cyber awareness is at the center of digital transformation. Understanding that is as transformative as cyber itself—and to be successful in this new era, organizations should embrace a “cyber everywhere” reality.

Cybersecurity breakdowns and data breach statistics

Are these growing concerns about cybersecurity justified? It certainly seems so when we consider these facts:

A global survey in 2018 by CyberEdge across 17 countries and 20 industries found that 78% of respondents had experienced a network breach.
The ISACA survey of cybersecurity professionals points out that it is increasingly difficult to recruit and retain technically adept cybersecurity professionals. They also found that 50% of cyber pros believe that most organizations underreport cybercrime even if they are required to report it, and 60% said they expected at least one attack within the next year.
Radware estimates that an average cyber-attack in 2018 costs an enterprise around $1.67M. The costs can be significantly higher, e.g. a breach at Maersk is estimated to have cost around $250 - $300 million, because of the brand damage, loss of productivity, loss of profitability, falling stock prices, and other negative business impacts in the wake of the breach.
Risk Based Security reports that there were over 6500 data breaches and that more than 5 billion records were exposed in 2018. The situation is not better in 2019, and over 4 billion records were exposed in the first six months of 2019.
An IBM Security study revealed that the financial impact of data breaches on organizations. According to this study, the cost of a data breach has risen 12% over the past 5 years and now costs $3.92 million on average. The average cost of a data breach in the U.S. is $8.19 million, more than double the worldwide average.

As would be expected, with Hacking as the top breach type, attacks originating outside of the organization were also the most common threat source. However misconfigured services, data handling mistakes and other inadvertent exposure by authorized persons, exposed far more records than malicious actors were able to steal.

Data security and cybersecurity in the legal profession

Third party professional services firms are often a target for malicious attacks because the possibility of acquiring high value information is high. Records show that law firms relationships with third-party vendors are a frequent point of exposure to cyber breaches and accidental leaks. Law.com obtained a list of more than 100 law firms that had reported data breaches and estimate that even more are falling victim to this problem, but simply don’t report it to avoid scaring clients and minimize potential reputational damage.

Austin Berglas, former head of the FBI’s cyber branch in New York and now global head of professional services at cybersecurity company BlueVoyant, said law firms are a top target among hackers because of the extensive high-value client information they possess. Hackers understand that law firms are a “one-stop shop” for sensitive and proprietary corporate information, merger & acquisitions related data, and emerging intellectual property information.

As custodians of highly sensitive information, law firms are inviting targets for hackers.

The American Bar Association reported in 2018 that 23% of firms had reported a breach at some point, up from 14% in 2016. Six percent of those breaches resulted in the exposure of sensitive client data. Legal documents have to pass through many hands as a matter of course, reams of sensitive information pass through the hands of lawyers and paralegals, and then they go through the process of being reviewed and signed by clients, clerks, opposing counsels, and judges. When they finally get to the location where records are stored, they are often inadvertently exposed to others—even firm outsiders—who shouldn’t have access to them at all.

Security SDL

A Logicforce legal industry score for cybersecurity health among law firms has increased from 54% in 2018 to 60% in 2019, but this is still lower than many other sectors. Increasingly clients are also asking for audits to ensure that security practices are current and robust. A recent ABA Formal Opinion states: “Indeed, the data security threat is so high that law enforcement officials regularly divide business entities into two categories: those that have been hacked and those that will be.”

Lawyers are failing on cybersecurity, according to the American Bar Association Legal Technology Resource Center’s ABA TechReport 2019. “The lack of effort on security has become a major cause for concern in the profession.”

“A lot of firms have been hacked, and like most entities that are hacked, they don’t know that for some period of time. Sometimes, it may not be discovered for a minute or months and even years.” Vincent I. Polley, a lawyer and co-author of a recent book on cybersecurity for the ABA.

As the volume of multilingual content explodes, a new risk emerges: public, “free” machine translation provided by large internet services firms who systematically harvest and store the data that passes through these “free” services. With the significantly higher volumes of cross-border partnerships, globalization in general, and growth in international business, employee use of public MT has become a new source of confidential data leakage.

Public machine translation use and data security

In the modern era, it is estimated that on any given day, several trillion words are run through the many public machine translation options available across the internet today. This huge volume of translation is done largely by the average web consumer, but there is increasing evidence that a growing portion of this usage is emanating from the enterprise when urgent global customer, collaboration, and communication needs are involved. This happens because publicly available tools are essentially frictionless and require little “buy in” from a user who don’t understand the data leakage implications. The rapid rate of increase in globalization has resulted in a substantial and ever growing volume of multilingual information that needs to be translated instantly as a matter of ongoing business practice. This is a significant risk for the global enterprise or law firm as this short video points out. Content transmitted for translation by users is clearly subject to terms of use agreements that entitle the MT provider to store, modify, reproduce, distribute, and create derivative works. At the very least this content is fodder for machine learning algorithms that could also potentially be hacked or expose data inadvertently.

Security SDL

Consider the following:

At the Connect 2019 conference recently, a speaker from a major US semiconductor company described the use of public MT at his company. When this activity was carefully monitored by IT management, they found that as much as 3 to 5 GB of enterprise content was being cut and pasted into public MT portals for translation on a daily basis. Further analysis of the content revealed that the material submitted for translation included future product plans, customer problem related communications, sensitive HR issues, and other confidential business process content.
In September, 2017, the Norwegian news agency NRK reported data that they found that had been free translated on a site called Translate.Com that included “notices of dismissal, plans of workforce reductions and outsourcing, passwords, code information and contracts”. This was yet another site that offered free translation, but reserved the right to examine the data submitted “to improve the service.” Subsequently, searches by Slator uncovered other highly sensitive data of both personal and corporate content.
A recent report from the Australian Strategic Policy Institute (ASPI) makes some claims about how China uses state-owned companies, which provide machine translation services, to collect data on users outside China. The author, Samantha Hoffman, argues that the most valuable tools in China’s data-collection campaign are technologies that users engage with for their own benefit; machine translation services being a prime example. This is done through a company called GTCOM, which Hoffman said describes itself as a “cross-language big data” business, offers hardware and software translation tools that collect data — lots of data. She estimated that GTCOM, which works with both corporate and government clients, handles the equivalent of up to five trillion words of plain text per day, across 65 languages and in over 200 countries. GTCOM is a subsidiary of a Chinese state-owned enterprise that the Central Propaganda Department directly supervises, and thus data collection is presumed to be an active and ongoing process.

After taking a close look at the enterprise market needs and the current realities of machine translation use we can summarize the situation as follows:

There is a growing need for always-available, and secure enterprise MT solutions to support the digitally-driven globalization that we see happening in so many industries today. In the absence of having such a secure solution available, we can expect that there will be substantial amounts of “rogue use” of public MT portals with resultant confidential data leakage risks.
The risks of using public MT portals are now beginning to be understood. The risk is not just related to inadvertent data leakage, but is also closely tied to the various data security and privacy risks presented by submitting confidential content into the data-grabbing, machine learning infrastructure, that underlie these “free” MT portals. There is a growing list of US companies already subjected to GDPR-related EU regulatory actions, including, Amazon, Apple, Facebook, Google, Netflix, Spotify and Twitter. Experts have stated that Chinese companies are likely to be the next wave of regulatory enforcement, and the violators list is expected to grow.
The executive focus on digital transformation is likely to drive more attention to the concurrent cybersecurity implications of hyper-digitalization. Information Governance is likely to become much more of a mission-critical function as the digital footprint of the modern enterprise grows, and becomes much more strategic.

The legal market requirement: an end to end solution

Thus, we see today, having language translation at scale capabilities has become an imperative for the modern global enterprise. The needs for translation can range from rapid translation of millions of documents in an eDiscovery or compliance scenario, to very careful and specialized translation of critical contract and court-ready documentation on to an associate collaborating with colleagues from a foreign outpost. Daily communications in global matters are increasingly multilingual. Given the volume, variety and velocity of the information that needs translation, legal professionals must consider translation solutions that involve both technology and human services. The requirements can vary greatly, and can require different combinations of man-machine collaboration, that include some or all of these different translation production models:

MT-Only for very high volumes like in eDiscovery, and daily communications
MT + Human Terminology Optimization
MT + Post-Editing
Specialized Expert Human Translation

Machine Translation: designed for the enterprise

RWS is a leader in developing secure, private, scalable enterprise-ready MT technology that can be deployed on premise, or in a private cloud, and also provides related expert services to ensure optimal tailored deployment. SDL’s NLP technology team bench is deeper than any other in the translation industry and the company’s MT technology is used by the largest global enterprises in the world, as well as many governmental agencies focused on national security and intelligence gathering activities. From the outset, RWS has focused on developing enterprise-friendly capabilities that include the following:

Guaranteed data security & privacy
Flexible deployment options that include on premise, cloud or a combination of both as dictated by usage needs
Broad range of adaptation and customization capabilities so that MT systems can be optimized for each individual client
Integration with primary enterprise IT infrastructure and software e.g. Office, Translation Management Systems, Relativity and other eDiscovery platforms
Rest API that allows connectivity to any proprietary systems that you may employ.
Broad range of expert consulting services both on the MT technology aspects and the linguistic issues
Tightly integrated with professional human translation services to handle end-to-end translation requirements.

RWS's translation capabilities range from handling large eDiscovery litigation related projects using MT enhanced with expert developed client-specific glossaries and search terms to improve the ability to identify relevant documents, to specialized and expert human translation services for critical content. secure translation supply chain solution provides an enterprise-class, vendor agnostic, secure translation platform that allows you to combine regulatory compliance and translation best practice. RWS has the most sophisticated and comprehensive end-to-end translation solution capabilities in the industry today, powered by over 1,400 in-house translators working closely with linguistic AI technology enables tools and technology.

To find out how RWS can support your multilingual eDiscovery-related data processing and translation strategy, please visit our multilingual eDiscovery pages, which will provide more insight on what we can do for you. To find out more about the Relativity end-to-end translation solutions capabilities look here, or watch this short video to learn how we can help.

Kirti Vashee

Independent Language Technology Consultant

Kirti Vashee is an independent language technology Consultant, specializing in machine translation and translation technology. He was also with Asia Online and was previously responsible for the worldwide business development and marketing strategy at statistical MT pioneer Language Weaver, prior to its acquisition by SDL. Kirti has long-term sales and marketing experience in the enterprise software industry, working both at large global companies (EMC, Legato, Dow Jones, Lotus, Chase) and several successful startups

All from Kirti Vashee