Performance Improvements in Apache Drill

Posted on by By Satya Gopi, in Business Intelligence | 0

Prerequisites: ApacheDrill

We are firing a query in Apache drill it is easily taking 3 minutes for fetching just 1 column from a Table,so to overcome we have used to 2 Performance Improvements

  1. Partition Pruning
  2. Parquet meta data caching
    Partition Pruning :

    Partition pruning allows a query engine to be able to determine and retrieve the smallest needed dataset to answer a given query. Reading small data means fewer cycles on the IO and fewer cycles on the CPU to actually process data.

    Example:

    create table dfs.tmp.inputcontrolsinfo partition by (`displayDate`,airport_code,location) as 
    select 
    distinct `displayDate`,
    fields[3].control.`modelvalue` as airport_code,
    fields[4].control.`modelvalue` as location
    from  `observation`	
    

    Above partition is doing on basis of displaydate ,airportcode,location,now we can fire the query as below

     Select * from dfs.tmp.inputcontrolsinfo 
    

    Partition will work just like as indexing concept only

    Parquet metadata caching :

    Capability to cache Parquet metadata in Drill. Once the metadata is cached, it can be refreshed as needed, depending on how frequently the datasets change in the environment.

    Command to use cache metadata.

    REFRESH TABLE METADATA dfs.tmp.inputcontrolsinfo ;
    

    You only have to run the REFRESH TABLE METADATA command against a table once to generate the initial metadata cache file. Thereafter, Drill automatically refreshes stale cache
    files when you issue queries against the table. An automatic refresh is triggered when data is modified.The query planner uses the timestamp of the cache file.

    In case if you have any queries please get us at support@helicaltech.com

    Thanks,
    SatyaGopi
    BI Developer
    Helical IT Solutions Pvt

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments