Prerequisites: ApacheDrill
We are firing a query in Apache drill it is easily taking 3 minutes for fetching just 1 column from a Table,so to overcome we have used to 2 Performance Improvements
- Partition Pruning
- Parquet meta data caching
-
Partition Pruning :
Partition pruning allows a query engine to be able to determine and retrieve the smallest needed dataset to answer a given query. Reading small data means fewer cycles on the IO and fewer cycles on the CPU to actually process data.
Example:
create table dfs.tmp.inputcontrolsinfo partition by (`displayDate`,airport_code,location) as select distinct `displayDate`, fields[3].control.`modelvalue` as airport_code, fields[4].control.`modelvalue` as location from `observation`
Above partition is doing on basis of displaydate ,airportcode,location,now we can fire the query as below
Select * from dfs.tmp.inputcontrolsinfo
Partition will work just like as indexing concept only
Parquet metadata caching :
Capability to cache Parquet metadata in Drill. Once the metadata is cached, it can be refreshed as needed, depending on how frequently the datasets change in the environment.
Command to use cache metadata.
REFRESH TABLE METADATA dfs.tmp.inputcontrolsinfo ;
You only have to run the REFRESH TABLE METADATA command against a table once to generate the initial metadata cache file. Thereafter, Drill automatically refreshes stale cache
files when you issue queries against the table. An automatic refresh is triggered when data is modified.The query planner uses the timestamp of the cache file.
In case if you have any queries please get us at support@helicaltech.com
Thanks,
SatyaGopi
BI Developer
Helical IT Solutions Pvt