I'm worked as a full-stack programmer in two banks: "Siberian Oil Bank" (6 years) and "Zapsibkombank" (2015-2020) in the departments of state and analytical reporting. Since 2020 in Company Innotech departments Big data processing (ETL-processes) at position chief developer.
2023-01-08
A good application is when it is created according to generally accepted canons, its code is easy to read and easy to understand. Such an application can be written manually every time, but there is an automatic way - this is Maven.
Apache Maven is a framework for automating the assembly of projects based on the description of their structure in files in the POM language. When creating an application, you can use ready-made templates from the maven repository , take an Archetype suitable for your goals and create your own Scala, JAVA or Android application based on it, which will already contain a structure for automatic assembly and testing. But you can create your own Application Archetype, and create your own applications from it. This is necessary when development is carried out in a team and all applications must comply with a strictly defined structure.
You can learn how to create your own Maven archetype and create your application based on it from my project on github.
2022-11-30
My goal was to create a pipeline for loading data from a ClickHouse database into Hive.
About this DBMS. It allows you to store large data in a columnar structure.
Supports data replication. Also works with Data Formats for input and output: TXT, CSV, JSON, PARQUET.
Data transfer can be carried out over an ssl connection using a certificate.
ClickHouse supports database connection methods
The final solution was to download the ClickHouse-client, input parameters: sql query, and the output received a file in the "parquet" format (this is 11 times smaller than the txt format), which is generated on the ClickHouse server side. The file size is 4.5 GB (this is much less than 50 GB when transferred via lan).Since this is parquet, the data can be processed using Spark.
There are discrepancies between ClickHouse and Spark data types. But this is all solved on the Clickhouse side by using the cast function