Seravia’s goal is to acquire public filings and make the information in them more useful and accessible to people and companies. To handle a massive volume of filings and diverse set of filing types, we have developed a fully-automated data supply chain. The purpose of the data supply chain is to take a source filing from a government agency and output a structured and linked object that can be used for research and analysis. To accomplish this the data supply chain must be:
Automated, able to retrieve and publish data without human involvement
Extensible, able to handle new datasources and datatypes with minimal engineering efforts
Accurate, scalable and cost-efficient
This paper describes the six key steps in the data supply chain:
Data Acquisition
The raw inputs for the data supply chain are filings stored in proprietary government datasources. For our purposes a datasource is defined as any remote repository of filings, irregardless of format, storage device or transport method. Examples range from dynamic HTML-based websites to password-protected FTP sites with compressed bulk text files.
Historically, government agencies have not been able to make their filings widely available due to outdated technologies and budget constraints. Therefore one requirement of our data acquisition technology is that there be no work or customization required by the source provider. We must work with whatever agencies currently use and migrate to new formats as they upgrade their systems.
The Seravia Data Fetcher is a proprietary platform that is responsible for scheduling, monitoring and managing data retrieval from disparate government datasources. The Data Fetcher architecture generally adheres to these requirements:
Extensibility
Given the variety of datasources, the Data Fetcher architecture supports a plugin-style job model. Each Job contains a small unit of work, such as an HTTP request or parsing assignment. The current Job library supports over 200 operations. An inheritance model allows the most common tasks to be centralized and easily overridden to meet the needs of specific datasources. Jobs are defined using a powerful and extensible XML-based configuration. The configuration allows non-experts to prepare new datasources while allowing scripting support for unusually complex datasources. Developing the code for the extraction of data from new datasources
usually takes only a few hours.
“Deep Web” Ability
The complexity of many government datasources requires the Data Fetcher to support “deep web” crawling. A “deep web” datasource is a website that is not easily accessed by standard web search engines, such as Google. Inability to access the data may be caused by a lack of direct links from other pages, or the data may be generated in response to a search or sit behind a login. The Data Fetcher supports these operations and more for websites in over 10 languages.
High Availability
The Data Fetcher is able to recovery from various failures and delivers, among other things, 24/7 crawling services. A semi-intelligent Job management system helps optimize data retrieval while handling various types of errors that can take place during runtime, such as loss of network connection, slow networks speeds, overwhelmed source servers, or recurring datasource downtimes. Data Fetching has the ability to retry Jobs, log errors, save state, and resume from points of failure.
Scalability
The Data Fetcher must manage a large number of semi-autonomous Jobs running in parallel and must be able to scale to n-many machines. A centralized controller distributes and manages Jobs. The Data Fetcher is a Ruby-based system optimized for deployment across Amazon EC2 clusters. All source data is stored on Amazon S3. Advantages of using S3 are unlimited data storage and excellent data security and disaster recovery.
Extraction, Transformation, Load (ETL)
In an ideal world there would be only one universal data representation of a company registration, patent, trademark and other legal concepts. In reality datatypes are not standardized across jurisdictions or even across agencies within the same jurisdiction. Efforts within Government 2.0 and industry standards groups attempt to improve the situation, but adoption is slow and inconsistent. The ETL step in the data supply chain extracts data from non-standard source filings, transforms the data into a common datatype-specific schema, and loads the data into our data warehouse. This allows users and our analysis code to make comparisons of similar datatypes that come from different jurisdictions and sources. The ETL steps include:
Extraction
The goal of extraction is to standardize the format of source data. Government data comes in a variety of data formats - XML, HTML, PDF, database formats, plain text, etc. In most cases the data is extracted using custom XPath-based parsers defined by data analysts. In rare situations the data is less structured and requires custom Python parsers written by engineers.
Transformation
Using the open-source Pentaho platform and the extensions that we have developed, our data analysts are able to convert from the proprietary datasource schemas to universal schemas. Each filing is stamped with a unique id, dates and currencies are standardized, duplicates are removed, and proprietary values are mapped to shared ones. Data tests are performed before and after ETL to verify correctness of the data or provide reports on errors in the original source filings. Almost all business logic is kept within the transformation step of ETL, greatly simplifying the data steps that follow.
Load
The final step in ETL is to load the data into a datastore for use by client applications. In our case the output of ETL takes two forms: JSON (javascript object notation) and TSV (tab-separated value) files. The JSON is loaded into a MongoDB cluster for use by the Seravia.com web application. The TSV files are passed on to the Analysis system.
Each Pentaho ETL script is architected to be run on an entire dataset or on subsets of data when new filings occur. This allows the system to stay current without requiring reprocessing existing data. The datasource-specific ETL scripts are continually run by clusters of Amazon EC2 server instances.
Analysis
Once the data is gathered, standardized and cleansed it is ready for analysis. The goal of analysis is to extract useful information from the filing data by identifying entities and activities that span more than one filing. For our purposes Analysis can be treated as synonymous with Data Warehousing and Business Intelligence. The outputs of the Analysis step are reports that provide insight into companies, people, industries and trends. These insights include but are not limited to:
- Company Clusters. Subsidiaries, aliases and other company-to-company relationships
- Persons Connections. Hidden associations and potential conflicts of interest among people
- Key Personnel. People associated with companies and industries and their relative importance
- Exposure. Competitive brands, inventions and other assets that may pose risks
- Locations. All places associated with a company or person, cleansed and clustered according to actual geo-location
- Comparables. Potential competitors and partners based on content in filings
At over 200 million filings and over 1 billion entities and relationships, analyzing the data using standard relational database querying methods is impractical. Instead we use four technologies commonly found in big data environments: Hadoop, Amazon Elastic Map Reduce (EMR), Hive and Mahout.
Hadoop is an open-sourced version of Google’s MapReduce architecture that allows for large numbers of parallel computations. Amazon’s EMR provides an environment in which we can bring large clusters online for analysis in a just-in-time model. Hive, sits on top of Hadoop and enables data analysts to query the data using a simple language based on SQL, which is already widely known by data analysts. For advanced analysis, such as classification, clustering, and collaborative filtering, we use Mahout, a highly scalable machine-learning platform that also sits on top of Hadoop.
Our Hive-based Analysis infrastructure has three levels:
Data
At the Data level we map our standardized schemas to data structures optimized for querying. This level converts the post-ETL data into a format ideal for high- performance analysis while adjusting for variations in the underlying data.
Relationships
Once the data has been prepared, common entities (companies, persons) and relationships among entities and filings are identified and stored for further analysis. This step involves a number of operations that identify identical entities by dealing with common issues: addresses that are slightly different or incomplete; proper names that include middle initials, nicknames or have misspellings; companies that have more than one name or entity. More complex data clustering, such as identifying patent similarities, takes place in this level.
Reports
The end goal of analysis is to generate reports that can answer a user’s questions: What is a company doing? What technology is seeing an increase in activity? What brands have issues? The report level includes custom queries that organize the identified relationships around entities or dimensions such as companies, persons, industries and geographies.
Search
The primary client of our data is our website, www.seravia.com. The web application allows users to search for and view the details of filings and reports. Both keyword and attribute-based search are important ways for users to locate specific filings. Keywords can be found in titles and descriptions, while common attributes include owner, type, jurisdiction and date. A design consideration of the data supply chain is that it supports these types of datatype-specific fields without requiring custom code in order to index new datasources. Therefore each datatype schema includes the fields to be indexed so that we can launch new datasources in a plug-and-play manner.
Due to the large corpus of government filings, performance and scalability are important qualities of the search engine. For this reason we use the open-source Sphinx search engine and MongoDB datastore, both clustered across Amazon EC2 instances. To support daily updates we have created custom modifications to handle incremental indexing.
Display
At the tail end of the data supply chain, each individual source filing is displayed on our website. Since the data has been converted to a standardized schema there is not a one-to-one field correspondence between the source and output filings. There are some business considerations involved in knowing just how to display the final filing as well as how to search engine-optimize (SEO) the resulting page by creating meaningful titles, metadata and sections.
As with computationally-intensive data querying, this type of data presentation falls somewhere between the skill set of an engineer and a data analyst. Engineers often do not understand the data as well as the analyst. Analysts often do not understand how to write code for web applications.
This creates a risk that engineers, working from design documents, will spend a lot of time writing simple data retrieval and display code with custom business logic for each one of the large number of datatypes and datasources. Such a solution would slow the deployment of new datasources and make future maintenance prohibitively expensive. A requirement of the data supply chain is that it must allow data analysts, not just engineers, to be able to map the presentation of data in the application.
In our case, the individual filing is served up from MongoDB and displayed using a custom templating system built on top of a Ruby on Rails application. This custom templating system allows data analysts to design dynamic web pages using YAML, a simple type of configuration file. These are designed at the same time as the datatype schema and further allow new datasources to be plugged-and-played.
Automation
Once the code for a new datasource has been developed it is deployed into our Amazon infrastructure to become part of our general data supply chain. To ensure the freshness of the data and scale to handle many datasources the production data supply chain operates daily with little-to-no human involvement. All of the acquisition, ETL and analysis steps are architected to run on a continual basis in a fully automated mode.
The Controller is the component that manages and monitors the individual steps in the supply chain. It supports a modular architecture that allows extraction, ETL and analysis steps to be chained together. It supports parallel processing to improve performance. Most importantly it supports error handling mechanisms that allow data to flow uninterrupted while sandboxing problematic data for future handling.
All components of the data architecture run on Amazon EC2 instances. All data, from source filings through interim steps until final JSON conversion, are stored in Amazon S3. The web application infrastructure - MongoDB clusters, Sphinx clusters, Ruby on Rails Mongrel Servers, Apache HTTP servers - all run on Amazon EC2 instances.