Balin Andrey, программист

I'm worked as a full-stack programmer in two banks: "Siberian Oil Bank" (6 years) and "Zapsibkombank" (2015-2020) in the departments of state and analytical reporting. Since 2020 in Company Innotech departments Big data processing (ETL-processes) at position chief developer.

2024-10-06

PySpark Local Development

This project allows local development on PySpark. The project is suitable for both beginners and experienced programmers when developing applications locally. Data can be saved in local tables, and scripts can be tested with small data sets. The project uses Docker technology, so there is no need to separately configure the development environment.

To get started, download the project from the repository on github, and follow the steps in the README file. How to install Docker on your computer see here.

2023-01-08

We compose our Archetype for Java or Scala applications

A good application is when it is created according to generally accepted canons, its code is easy to read and easy to understand. Such an application can be written manually every time, but there is an automatic way - this is Maven.

Apache Maven is a framework for automating the assembly of projects based on the description of their structure in files in the POM language. When creating an application, you can use ready-made templates from the maven repository , take an Archetype suitable for your goals and create your own Scala, JAVA or Android application based on it, which will already contain a structure for automatic assembly and testing. But you can create your own Application Archetype, and create your own applications from it. This is necessary when development is carried out in a team and all applications must comply with a strictly defined structure.

You can learn how to create your own Maven archetype and create your application based on it from my project on github.

2022-11-30

Working with the ClickHouse data source

My goal was to create a pipeline for loading data from a ClickHouse database into Hive.
About this DBMS. It allows you to store large data in a columnar structure. Supports data replication. Also works with Data Formats for input and output: TXT, CSV, JSON, PARQUET. Data transfer can be carried out over an ssl connection using a certificate.
ClickHouse supports database connection methods

http-Requests (data in txt format only)
ClickHouse-Driver (python library, data in txt format only)
ClickHouse-client (this is a console utility, data format: txt, parquet, csv, json)

Daily download size: 150-200 million records, 4.5 GB in parquet-file. The problem is that If you use http-request, then 200 million records in txt format will be transferred for a very long time, since the total amount will be 50 GB of data. In addition, after receiving this data, they need to be processed on their side.

The final solution was to download the ClickHouse-client, input parameters: sql query, and the output received a file in the "parquet" format (this is 11 times smaller than the txt format), which is generated on the ClickHouse server side. The file size is 4.5 GB (this is much less than 50 GB when transferred via lan).Since this is parquet, the data can be processed using Spark.

There are discrepancies between ClickHouse and Spark data types. But this is all solved on the Clickhouse side by using the cast function

Big Data

DAG's write in Airflow
- BashOperators (spark-submit launch)
- PythonOperators
- BranchOperators
- ... and more
ETL-processes. Data upload process
- S3 to Hive
- sftp to hive
- API REST (Request) to Hive.
- ClickHouse to Hive
- Hive to PostgreSQL
Docker - development of solutions using containerization
- Deployment Apache Spark and Hadoop environment
- Multi-POM project of Maven (Scala)
- Docker image for first level of web-devolopment
Devolopment on language
- Python (PySpark)
- Scala

Business intelligence

QlikView BI System Administration.
Installation configuring QlikView system components on the server:
- setting up QlikView components interaction between servers
- setting up automatic processing of cubes
- user license accounting
Development and maintenance of cubes based on BI QlikView technology.
Maintenance of Microsoft Analysis Services olap cubes.
Organization of data sources based on MS SQL Server and Oracle Database.
Development and maintenance of solutions on pl / sql Oracle Database.
Development and support of MS Office VBA reports.
Work with version control systems Git, SVN.

Desktop Application Development

Skills in object-oriented programming in development environments: Borland Delphi 7, BDS 2006, CodeGear RAD Studio 2009:
- Client-server applications on the MS SQL Server
- Development of multi-user systems
- Report development based on MS Office and OpenOffice
- Applications of various kinds: work with mail, XML, work with the Windows operating system at the API level.
Work with MS SQL Server (T-SQL): query optimization, stored procedures, views, recursive queries.
Development of programs for bank divisions in areas:
- interbank lending
- corporate deposits
- bonds and stocks
- automated calculation of bank statements by forms
- software development ABS RS-Bank 5.5 in the form of macros in the embedded language RSL