Keywords — Information Graph, Natural Language Processing, Named Entity Recognition, Data Analytics and Data Visualizations
GIG is large-scale information storage, querying and processing system for public information. With GIG, we aim to improve the quality of information publicly available in many social, political, and economic areas making it easier and more efficient for businesses, the media, the broader public, and government to use. GIG is built with the power to store different formats of data into one structure and has the power to read and process the data to understand the content and map the data with data collected from other data sources. Basically, GIG can gather information from different sources and connect the data to get an overall idea from different perspectives.
For example, GIG reads news articles from different national newspapers and reads the data to identify any people or organizations mentioned in the content. It can then find how the same news has been reported in other newspapers as well. A news analyst or a journalist will find this information useful to analyze the credibility and reliability of news sources by comparing how the same information has been presented in different sources. This is just one of the many use cases of GIG. With creativity and imagination, developers can come up with innovative ideas to harness the full power of the GIG system.
How Does It Work?
Lets’ deep dive into the GIG architecture to understand how the components of the system work together to provide scalable and dynamic data storage.
GIG Core and API
GIG API Server is the core of this system. It manages requests from the crawlers to store data into our MongoDB and Minio storage. It is capable of identifying duplicate entities and avoid duplication of entities. Instead, it compares the information in the two instances and merges the data where possible with the timestamps. Timestamps allow us to track the modifications made to an entity over time until the end of its lifespan.
The core of the GIG and the server API are written in Golang. Check out our main repository at https://github.com/LSFLK/GIG. You may refer to the documentation for more details on how to deploy the server locally.
The crawlers are designed to extract data from different sources. Including pdf files, scanned document images, websites, spreadsheets, etc. Once the raw data is extracted from the source, the crawler fit the data into a GIG entity model which supports dynamically adding attributes and formatting the data into more processable content. Then the crawlers can use the NER servers to read the content and identify key entities (people, organizations, locations) mentioned in the content and link it with those entities in the system. Which will then schedule crawlers to find data related to those extracted entities from the internet.
For example, if article B mentions a person named P our system will immediately store the information of article B into the system and try to link it with a Person P if that person’s profile is available in the system. Otherwise, it will call up a new process to find data regarding person P from different internet sources and will eventually be connected with article B.
The main objective of GIG is to provide a platform for developers to build data-oriented applications on top of it. The GIG ecosystem consists of a variety of modern technologies. We have designed it to interface with any alternative technologies if you prefer to implement it on your own.
A major challenge in maintaining data from different sources is to have the ability to store data in different formats. The data includes text content, lists, arrays, etc. GIG uses the powerful MongoDB database management system to provide scalability and dynamic data structuring for the system.
With data coming from different sources, another problem we have to face is the storing of multimedia data. Also, we need to store this data asynchronously without affecting the response time from the server to keep the crawlers running without any interruptions and waiting. For that, we use the MinIO file server as the file storage solution to store multimedia data gathered through various sources. It automatically handles storing multimedia files into the storage asynchronously without us having to worry about it.
We have developed a Named Entity Recognition Service to process data gathered through web scrapers and crawlers. It can identify names, organizations and locations from given text content. This service helps us to read and identify and relate the content to different entities. Check out our Python wrapper for Stanford NER at https://github.com/umayangag/Standford-NER-python-wrapper
One another issue that rises when collecting data from different sources is the standards in referring to persons, organizations and locations. For example, Sri Lanka might be referred to as “sri lanka”, “lanka”, “ceylon” in different sources. To identify these ambiguities we seek the help of normalization servers. Normalization Service provides text processing to identify spelling mistakes and provide suggestions for known entity names by referring to the existing data in the system. In addition to that, we can use the service to process and clean names of persons, organizations and locations.
We have developed a minimalistic front-end using React to be used for testing and development purposes. Of course, you are welcome to develop your own third party applications with attractive interfaces to provide a more user-friendly experience.
- Web Server — https://github.com/LDFLK/GIG
- Development Kit — https://github.com/LDFLK/GIG-SDK
- Sample Crawlers — https://github.com/LDFLK/GIG-Scripts
- Minimalistic Frontend UI — https://github.com/LDFLK/GIG-Client
- NER Wrapper — https://github.com/umayangag/Standford-NER-python-wrapper