Create multiple Flows for an unlimited number of use cases
Data Flow consists of a point-and-click user interface enabling you to create various flows to share granular data in the right format to the right stakeholders.
Subscribing to Data Flow
Data Flow is a paid option that can be activated on your organization.
In order to access Data Flow reach out to your account manager via your administrator or contact our consulting team.
Interface
Creating a new data flow
Flows can be configured directly from our Explorer interface in the form of scheduled exports.
Access the scheduled exports app via Explorer’s drop-down list.
A dedicated interface will enable you to configure your data flows. You will be able configure the following aspects of your data flows:
- Cross-site: Select a site, several sites or all the sites belonging to your organization
- Properties: Choose which properties you wish to include in your exports (standard and custom).
- Format: Choose the type of separator and the export format – between CSV, JSON, PARQUET.
- Schedule: Choose the export frequency - 15, 30 or 60 minutes
- Export location: Choose the location of the export, sFTP, Amazon S3
Accessing to the Data Flow interface
To connect to the Data Flow interface, go to the Export app:
You will be able to create a scheduled export directly on the export interface.
Creation of a Data Flow export
Click on "Create a Data Flow export" to create a new scheduled data export.
You will access to the following 2 step process:
STEP 1: Flow contents
The scope of your export Do you wish to retrieve information relating to all your sites? or only one or a subset of your websites? |
|
Activation of the privacy filter You have the possibility to activate the privacy filter in order to comply with regulations. With the privacy filter activated, you can exclude data from users who have not granted their consent. If you decide not to activate the privacy filter, you will have to purge some of the data received after receiving all the Piano Analytics data in order to comply with your country's regulations. The privacy filter is based on the property value "visitor_privacy_consent = false". |
|
Customizing your data flows Do you want to exclude specific properties or do you prefer to export all your existing and future properties?
|
STEP2 : Flow configuration
Flow name Name your data flow |
|
Flow format Choose your export format: CSV, JSON or Parquet. Configuration change automatically for each format you select. Please note that JSON is actually NDJSON (Newline Delimited JSON). Please refer to http://ndjson.org/ for more information. |
|
Flow frequency
|
Configuring Amazon S3 connections
Configuring Amazon S3 connections is a straight forward process.
In order to start, select "Amazon S3" as the type sending mode when setting up to your scheduled Export.
Then select "create a new connection":
You will be prompted with the following configuration window. In order for to know where to push the files and have the necessary rights to send the files we need the following information:
- S3 Configuration Name - the information requested here is for your reference. This will enable you to find and edit your connection amongst your other Amazon S3 connections you may have configured with Piano Analytics
- Bucket Name - We need inforation regarding the name of your Amazon S3 bucket. This is a unique identifier specific to your bucket only. This will enable us to send the files to the right place
- Destination Folder - If you wish to send the exports to place other than the main bucket folder you can specify the path here, sub-folders should be seperated by "/"
- Access configuration rule - Access to the customer's S3 storage bucket is done through AWS role authentication. Piano Analytics will provide the customer with the ARN of a role to authorize on its bucket to allow write access to deposit files. The customer is responsible for setting up the "Bucket Policy" giving access to the Piano Analytics role.
- Test the connection - when you press the "test the connection" button we will send a small file over to your Amazon S3 Bucket which will enable us to confirm that the information you have provided is correct and the connection has been set up correctly
& hit save!
When you return to the configuration, hit: to refresh the Amazon S3 connection list. You should see the connection you've just configured appear in the drop-down list.
Configuring FTP or sFTP connections
Select "FTP" as your send method when setting up to your Export.
Then select "create a new connection":
You will be prompted with the following configuration window. In order for to know where to push the files and have the necessary rights to send the files Piano Analytics needs the following information:
- Name of the FTP configuration - the information requested here is for your reference. This will enable you to find and edit your connection amongst your other FTP connections you may have configured with Piano Analytics
- IP/Server Name - We need inforation regarding the IP or server name of where your FTP server is located
- Protocol - in this section you will be able to specify which protocol you wish to use: FTP or sFTP
- Port - In this section you are able to specify which port you would like to use.
- Login & Password - For us to deposit the file we need to have valid login & password details with sufficient access rights
- Destination Folder - If you wish to send the exports to place other than the main bucket folder you can specify the path here, sub-folders should be seperated by "/"
- Testing the connection - when you press the "test the connection" button we will send a small file over to the FTP which will enable us to confirm that the information you have provided is correct and the connection has been set up correctly.
& hit save!
Do you need to whitelist IP addresses for security reasons?
You can contact our support center to get the most recent IP addresses to whitelist.
Data Availability
Some data will not be available on Data Flow. The reason is simple: these properties are calculated "on-the-fly" and are not stored. For example, we don't know in advance if a ongoing visit will be considered as "bounce" visit. Because of this, the property "visit_bounce" can't be included to Data Flow.
Scheduling
The Data Flow export scheduling system are not based on complete time periods. Every time a file is sent (every 15, 30 or 60 minutes), all new data inserted in the database since the last export is sent. Because of this certain events relating to an earlier file may be included in the latest file. Because of this process we can ensure we do not miss out on any events.
Data Flow exports are valid for 12 months, at which time they will exire. You will receive a notification via email 1 month prior to the expiration date reminding you to extend the export.
Data Partitioning
Data Flow includes a new feature: partitioning. Since files do not cover complete time periods, partitioning allows you to organize your files according to the time period they contain. This period is based on the UTC date of data collection (hit_time_utc property).
You can choose your level of partitioning:
- Date
- Date / Time
- Date / Time / Half hour
- Date / Time / Quarter hour
Let's take a concrete example. Let's imagine that you have created an export with a date/time level partitioning. The export is called myExport, it is sent to a folder "myFolder".
The content of the table at the time of extraction is as follows:
Hit_time_utc |
event_id |
... |
08/04/2021 10:58:14 |
192746238290 |
... |
08/04/2021 10:59:27 |
192746857489 |
... |
08/04/2021 11:00:28 |
387362092413 |
... |
08/04/2021 11:01:45 |
238290192746 |
... |
The dataset spans accross two hour intervals, it will therefore be split into two different files. In the case of an S3 export destination :
myFolder/myExport/data/date=2021-04-08/hour=10/data_36_019ac5aa-3252-4738-0000-03d5ba4052ba_001_0_0.csv.gz
and
myFolder/myExport/data/date=2021-04-08/hour=11/data__36_019ac5aa-3252-4738-0000-03d5ba4052ba_001_0_0.csv.gz
The first file will contain the extracted events with hit_time_utc property values between 10:00:00 and 10:59:59, the second file will contain events with a hit_time_utc property values between 11:00:00 and 11:59:59. If in the next iteration we identify new events belonging to the 11:00:00 - 11:59:59 time interval, a new file will be created in the /hour=11/ folder.
In the case of a (s)FTP upload, the logic is similar, only the file name change. As folders cannot be dynamically created, it is the file names themselves that carry the time interval information:
myFolder/myExport_2021-04-08_10_data_36_019ac5aa-3252-4738-0000-03d5ba4052ba_001_0_0.csv.gz
myFolder/myExport_2021-04-08_11_data_36_019ac5aa-3252-4738-0000-03d5ba4052ba_001_0_0.csv.gz
Format
Once your flow is created, you will receive the files generated by Data Flow directly on your Amazon S3/sFTP server. Each generated file is compressed in GZ format. The file names include a naming convention managed by Snowflake which can be found at the end of the file name as follows:
data_#GUID#_#NumberingSnowFlake#.format
You will find examples in the previous paragraph.
In order to speed up the processing, Snowflake executes the request in different parts on several machines. Because of this you may receive multiple files for the same time interval.
Important :
Regarding CSV exports, we recommend you not to base your treatments on the columns indexes or property names but on their property key, as we column indexes & property names are subject to change.
In JSON exports, properties with a null value won't appear for the relevant event.
Delivery report
After each files generation on your Amazon S3 bucket / sFTP, you will also receive a ".report" file which will contain the list of all generated files' names you have just received.
If you have received a file which name is not included in any delivery report, then this file is not complete and you should not take it into account.
Receiving this delivery report validates the generation and delivery of exports on a given time period.
E.g. : at 10:20 UTC a generation is launched, resulting in three generated data files:
- myFolder/myExport/data/date=2021-10-31/hour=09/min=45-60/data_019ac5aa-3252-4738-0000-03d5ba4052ba_001_0_0.csv.gz
-
myFolder/myExport/data/date=2021-10-31/hour=10/min=00-15/data_049fc5da-3372-8338-1200-08f7bc4801cz_001_0_0.csv.gz
-
myFolder/myExport/data/date=2021-10-31/hour=10/min=00-15/data_049fc5da-3372-8338-1200-08f7bc4801cz_002_0_0.csv.gz
Once these three files have been entirely sent to your S3 bucket / sFTP, a delivery report named #timestamp#_delivery.report will be sent to you, it will contain all three delivered filenames and will validate the fact that you can import their content with confidence.
History generation
Regeneration exports relating to past time intervals will be based on the same level of partitioning as your production exports. The regeneration can be done in a separate folder.
In order to be able to generate datasets for past time periods from date X to date Y and to minimize the risk of duplicates or data gaps, here is the procedure to follow:
- If you have not already done so, set your flow live with the desired frequency.
- At Y+1, you have to delete the data that you have ingested from your database with a query as follows: DELETE * FROM #mytable# WHERE hit_time_utc < Y+1
- The Piano Analytics team will create a past interval regeneration ticket for the Export team specifying :
- Your customer account on which the production export was created
- The name of your export
- The exact period to be regenerated (X to Y)
It is possible that some flows in production may contain duplicates. This is not inherent to Data Flow but to the collection of data. These duplicates will also appear in Data Query on the day in real time. Duplicates are removed on D+1 in the real-time data table, but this has no impact on Data Flow files already consumed.
In case of a real time delay, events not included in a file will be included in the following files, always based on their date of insertion in the table. In which case, one can have smaller files than usual over a period and then larger files in the following iteration(s).