Create multiple Flows for an unlimited number of use cases
Data Flow consists of a point-and-click user interface enabling you to create various flows to share granular data in the right format to the right stakeholders.
Subscribing to Data Flow
Data Flow is a paid option that can be activated on your organization.
In order to access Data Flow reach out to your account manager via your administrator or contact our consulting team.
Creating a new data flow
Flows can be configured directly from our Explorer interface in the form of scheduled exports.
Access the scheduled exports app via Explorer’s drop-down list.
A dedicated interface will enable you to configure your data flows. You will be able configure the following aspects of your data flows:
- Cross-site: Select a site, several sites or all the sites belonging to your organization
- Properties: Choose which properties you wish to include in your exports (standard and custom).
- Format: Choose the type of separator and the export format – between CSV, JSON, PARQUET.
- Schedule: Choose the export frequency - 15, 30 or 60 minutes
- Export location: Choose the location of the export, sFTP, Amazon S3
Accessing to the Data Flow interface
To connect to the Data Flow interface, go to the Export app:
You will be able to create a scheduled export directly on the export interface.
Creation of a Data Flow export
Click on "Create a Data Flow export" to create a new scheduled data export.
You will access to the following 2 step process:
STEP 1: Flow contents
The scope of your export
Do you wish to retrieve information relating to all your sites?
or only one or a subset of your websites?
Activation of the privacy filter
You have the possibility to activate the privacy filter in order to comply with regulations. With the privacy filter activated, you can exclude data from users who have not granted their consent. If you decide not to activate the privacy filter, you will have to purge some of the data received after receiving all the Piano Analytics data in order to comply with your country's regulations. The privacy filter is based on the property value "visitor_privacy_consent = false".
Customizing your data flows
Do you want to exclude specific properties or do you prefer to export all your existing and future properties?
STEP2 : Flow configuration
Name your data flow
Choose your export format: CSV, JSON or Parquet. Configuration change automatically for each format you select.
Please note that JSON is actually NDJSON (Newline Delimited JSON). Please refer to http://ndjson.org/ for more information.
Configuring Amazon S3 connections
You can find all the technical documentation to configure Amazon S3 connections here.
Configuring FTP or sFTP connections
You can find all the technical documentation to configure FTP/sFTP connections here.
If you need to send your exports on a FTP/sFTP, we advise you to use a 15 or 30 minutes export frequency so as to limit transfered file sizes.
Some data will not be available on Data Flow. The reason is simple: these properties are calculated "on-the-fly" and are not stored. For example, we don't know in advance if a ongoing visit will be considered as "bounce" visit. Because of this, the property "visit_bounce" can't be included to Data Flow.
The list of the properties not available in Data Flow can be found here.
The Data Flow export scheduling system are not based on complete time periods. Every time a file is sent (every 15, 30 or 60 minutes), all new data inserted in the database since the last export is sent. Because of this certain events relating to an earlier file may be included in the latest file. Because of this process we can ensure we do not miss out on any events.
Data Flow includes a new feature: partitioning. Since files do not cover complete time periods, partitioning allows you to organize your files according to the time period they contain. This period is based on the UTC date of data collection (hit_time_utc property).
You can choose your level of partitioning:
- Date / Time
- Date / Time / Half hour
- Date / Time / Quarter hour
Let's take a concrete example. Let's imagine that you have created an export with a date/time level partitioning. The export is called myExport, it is sent to a folder "myFolder".
The content of the table at the time of extraction is as follows:
The dataset spans accross two hour intervals, it will therefore be split into two different files. In the case of an S3 export destination :
The first file will contain the extracted events with hit_time_utc property values between 10:00:00 and 10:59:59, the second file will contain events with a hit_time_utc property values between 11:00:00 and 11:59:59. If in the next iteration we identify new events belonging to the 11:00:00 - 11:59:59 time interval, a new file will be created in the /hour=11/ folder.
In the case of a (s)FTP upload, the logic is similar, only the file name change. As folders cannot be dynamically created, it is the file names themselves that carry the time interval information:
Once your flow is created, you will receive the files generated by Data Flow directly on your Amazon S3/sFTP server. Each generated file is compressed in GZ format. The file names include a naming convention managed by Snowflake which can be found at the end of the file name as follows:
You will find examples in the previous paragraph.
In order to speed up the processing, Snowflake executes the request in different parts on several machines. Because of this you may receive multiple files for the same time interval.
Regarding CSV exports, we recommend you not to base your treatments on the columns indexes or property names but on their property key, as we column indexes & property names are subject to change.
In JSON exports, properties with a null value won't appear for the relevant event.
After each files generation on your Amazon S3 bucket / sFTP, you will also receive a ".report" file which will contain the list of all generated files' names you have just received.
If you have received a file which name is not included in any delivery report, then this file is not complete and you should not take it into account.
Receiving this delivery report validates the generation and delivery of exports on a given time period.
E.g. : at 10:20 UTC a generation is launched, resulting in three generated data files:
Once these three files have been entirely sent to your S3 bucket / sFTP, a delivery report named #timestamp#_delivery.report will be sent to you, it will contain all three delivered filenames and will validate the fact that you can import their content with confidence.
Regeneration exports relating to past time intervals will be based on the same level of partitioning as your production exports. The regeneration can be done in a separate folder.
In order to be able to generate datasets for past time periods from date X to date Y and to minimize the risk of duplicates or data gaps, here is the procedure to follow:
- If you have not already done so, set your flow live with the desired frequency.
- At Y+1, you have to delete the data that you have ingested from your database with a query as follows: DELETE * FROM #mytable# WHERE hit_time_utc < Y+1
- The Piano Analytics team will create a past interval regeneration ticket for the Export team specifying :
- Your customer account on which the production export was created
- The name of your export
- The exact period to be regenerated (X to Y)
It is possible that some flows in production may contain duplicates. This is not inherent to Data Flow but to the collection of data. These duplicates will also appear in Data Query on the day in real time. Duplicates are removed on D+1 in the real-time data table, but this has no impact on Data Flow files already consumed.
In case of a real time delay, events not included in a file will be included in the following files, always based on their date of insertion in the table. In which case, one can have smaller files than usual over a period and then larger files in the following iteration(s).