The global law enforcement and intelligence community are facing a global fight against threats such as terrorism and organized crime (human trafficking & smuggling, drug smuggling & cyber crime). Not only are these threats characterized by the use of different languages and dialects, but also by the use of different means of communication. The advances in communication interception and intelligence technology (SIGINT, COMMINT, OSINT) are generating a large volume of information in different formats (audio, video, unstructured text) and a variety of dissemination speeds, that have to be analyzed. However, given the potential public safety aspect of the analysis, finding actionable intelligence quickly is critical. How can we automate the ingestion of information and improve the detection of insights in the data?
One of the mandatory requirements is the translation of the content to improve the exploitation of data. Here the human element is the weakest link in the chain. There are simply not enough human translators of Arabic, Urdu, Russian and other languages of interest. So, what could be the solution?
We can take advantage of the big data & analytics paradigm and apply it in this use case. By mixing SDL Machine Translation in an ingestion flow to a Data Lake, with advanced analytics over it, is a step in the right direction. Essentially, the translation software must work in an offline environment, operating behind the firewalls of customer facilities, which is one of the key capabilities of SDL Machine Translation.
The Data Ingestion concept can be seen as the process by which data from a range of sources, structures and with different characteristics is introduced into another storage and processing system. This concept fits perfectly in the use case of the Government sector where a lot of information is gathered from different sources, formats, and languages. And in this sector, the information needs to be exploited and understood properly as quickly as possible. And of course, to understand the information, it must be rapidly normalized into one common language.
The following needs to be considered in the data ingestion process:
- Data source and format. Questions like these need to be answered: Is it an Internet broadcast? Social networks information? Intercepted documents? What volume is going to be ingested? Are we going to extend the sources in the future?
- Latency / availability. This concerns the time period between the ingestion of data and its use.
- Updates. Is data regularly modified?
- Transformations. Must the data be transformed? It is a crucial key in the translation process because it will be one of the main transformations conducted. Other processes involved could be speech to text, name entity extraction, ocr, sentimental analysis, … The introduced latency in the process needs to be considered.
- Data Destination. At this point, it is necessary to know if the data will have more than one destination. HDFS, Cassandra? Another fundamental question relates to the later exploitation of information. How can the translation be linked to the original?
- Data Study. Can the quality of data be measured? How can we establish the security access policy over the data? Regarding the translation process, can we have metrics about the accuracy in function of the original document quality?
At Future Space, the Apache NiFi system is the technology chosen to create ingestion flows. These can be integrated with the SDL ETS software. This technology fits perfectly in a scalable big data architecture where files can be transferred to HDFS (Hadoop Distributed File System).
Apache NiFi is a widely used tool of the Apache Foundation (https://nifi.apache.org/) which allows the automation and management of data flows between different systems such as databases, filesystems, Hadoop clusters, web services and so forth. It acts over the data in near-real time (harvesting, format changes, analysis, filtering, enrichment, transformation, translation, …) and making sense in an environment with a high volume of data. The definition of these processes can be made with a web user interface. Also, the desired functionality can be extended and customised with the client’s personal code.
Four principal concepts should be taken into consideration, all of which are related to the translation process and integration with SDL’s leading Neural Machine Translation solution:
- Flowfiles. The data managed by NiFi is encapsulated within a file called “flowfile” (FF). This is the file or the content to be translated with a set of metadata, known as “attributes”.
- Processors. The processors are code containers that are designed to carry out operations over the FF’s. So as to implement a NiFi flow, a sequence of processors has to be created. They reproduce the sequence of actions which need to be taken with the data. In our case, the processors are in charge of text extraction and invocation SDL Machine Translation web services to detect the language and to translate the content. Also, they decide whether or not to send the data for translation.
- Connections. The connections are links with an orientation (having a source and a destination with a specific direction). These enable the data flow itself to be established, determining the movement of FF between processors.
- Controller Services. These shared services can be used by NiFi elements, for instance, the processors, for the execution of their tasks. Amongst others, readers and writers of files in different formats, database connections, distributed cache services, and so on are examples of these controller services.
In conclusion, in a demanding, secure environment such as that of the Government, NiFi technology is a perfect driver for the automation of the translation of documents and content needed to be analyzed in time. Any Government Intelligence or Law Enforcement organization experiences serious difficulties when managing vast amounts of multilingual data. There is an overwhelming amount of material to be translated, and an insufficient number of translators.
This challenging situation is greatly eased by combining Apache NiFi (for data flow automation and provenance tracking) and SDL’s Neural Machine Translation solution. By using both NiFi and SDL, the ingestion process will help to reduce/minimise the problem, making the normalization and the automation of the translation of content possible. NiFi is one of the perfect allies of SDL translation software in use cases characterized by the automation of a high volume of documents.
Handling large amounts of multilingual data can be a serious challenge for any Government Intelligence or Law Enforcement organization. But with a combination of Apache NiFi for data flow automation and provenance tracking and SDL’s state-of-the-art secure Neural Machine Translation solution, this task can be tackled accordingly, improving the speed of data ingestion and normalization. Interested in learning more? Visit Future Space Homeland Security site.
This article has been drafted by George Bara for Future Space. Future Space is a Spanish software engineering company, founded in 1996. It specializes in developing intelligence solutions for LEA and the Intel community, amongst others. One of its major differences is based on our technology of Open Source Intelligence. This includes the design, implementation and deployment of Big Data and Advanced Analytics systems in order to integrate a multitude of sources, the ingestion of information and the design of personalized analytics.
Published on December 16, 2019 in Machine Translation