Data ingestion is the first step in data pipeline and it involves fetching data from one or various data sources into a system wherein it can be stored and analyzed. Based on the datasource, data ingestion can be done either in real time (streaming) or in batches.
Processing of different batches can be concurrent too. With streaming, as the name suggests, as soon as the data comes in, it is loaded into the target, near real-time.
Various factors make data ingestion an extremely complex process including increasing number and variety of data sources, structured and unstructured data, speed of data, identifying and capturing changed data, etc. A good data pipeline involves building data ingestion which is able to handle the above challenges along with taking care of network latency, network bandwidth, etc.
We are experienced in various types of data ingestion tools – proprietary as well as open source. Some ETL tools we work with are Talend / Pentaho Data integrator, Apache Flume, Apache Flink, Apache Spark, Kafka, Nifi, Sqoop, kylo, etc.