Skip to main content

Explanation on File Schema

BladePipe supports file type data sources. When adding this type of data source, you can define the parsing format. This page introduces the format definition to facilitate users to integrate data.

File Ingestion Process

BladePipe supports three file type data sources: SshFile, OssFile, S3File, which are the files obtained by remote access via SSH or local access to POSIX file system, Alibaba Cloud OSS, and AWS S3.

When adding the above three data sources, you can configure additional parameters, as follows:

ParameterDescription
dbsJsonDefine the schema to parse files (explained in the following section)
defaultLineSchemaJsonIf the file schema is not found in dbsJson, this parameter configuration is used by default (explained in the following section), that is, the first column in a row is obtained (if the text file is not split by colSep, the entire row of data is deemed as one column)
fileSuffixArrayFilter file type. Multiple file types can be written here, separated by commas. If it is empty, all files (directory files excluded) will be obtained. EMPTY is a special suffix, that is, files with no suffix are obtained, and they are processed as text files by default.
withMetaFieldsWhether to include the file meta information where the text is located. If it's true, __cc_src_file varchar(512) DEFAULT NULL,__cc_src_url varchar(512) DEFAULT NULL is included.

When accessing the files through the above three methods, BladePipe splits the files according to the file type, and then searches the corresponding file schema from the value of the dbsJson parameter for data processing (selecting columns and type conversion).

BladePipe determines the file type by the file suffix. It supports the following file suffixes:

  • Text (EMPTY, .log, .sql, .txt, .java, .js, .cpp, .c)
  • Markdown (.md)
  • Json (.json)
  • Excel (.xls, .xlsx)
  • CSV (.csv)
info

Markdown and Json files are currently processed as Text files (read row by row), while Excel and CSV are parsed in standard format. For CSV, the table header can be ignored by setting the value of parameter skipRows.

dbsJson Parsing

dbsJson is the file format definition, which can be applied to all text files. Its main functions include schema migration to structured databases (such as relational databases and data warehouses), integration of required data (column selection), and type conversion.

  • e.g.

    [
    {
    "db":"cc_virtual_fs",
    "schemas":[
    {
    "schema":"/Users/zylicfc/hudi_data",
    "tables":[
    {
    "table":"/Users/zylicfc/hudi_data/2025-03-25-14-52-29_EXPORT_CSV_18356972_222_worker_stats_0.csv",
    "columns":[
    {
    "column":"id",
    "jdbcType":-5,
    "typeName":"bigint",
    "precise":20,
    "idx":0
    },
    {
    "column":"gmt_create",
    "jdbcType":93,
    "typeName":"timestamp",
    "idx":1
    },
    {
    "column":"worker_id",
    "jdbcType":-5,
    "typeName":"bigint",
    "precise":20,
    "idx":3
    },
    {
    "column":"decimal_col_new",
    "jdbcType":3,
    "typeName":"decimal",
    "precise":10,
    "scale":2,
    "idx":16
    }
    ]
    }
    ]
    }
    ]
    }
    ]
    • Explanation:
      • schema: For SshFile, its value cannot be empty and needs to be selected when creating a DataJob. For OssFile and S3File, its value can be empty and it needs to be filled in only when the file column information needs to be defined.
      • tables: Its value can be empty and is only filled in when the file column information needs to be defined.
      • jdbcType: The integer value of the Java Types corresponding to the jdbcType. For details, refer to the Explanation on Java Types.
      • typeName: Type name. For specific values, please refer to Supported TypeNames of File Schema.
      • precise: For boolean (BOOLEAN), integer (INTEGER, BIGINT), long text (TEXT), and time (DATE, TIME, TIMESTAMP) types, the value can be empty. For string (VARCHAR), decimal (DECIMAL), and binary (BYTES) types, it needs to be specified.
      • scale: For decimal (DECIMAL) type, the value cannot be empty. For other types, it can be empty.
      • colSep: Split single-line content in text files. If it is empty, the default task-level parameter colSep is used for splitting. For Excel and other files, the separator is invalid. Special symbols ($, \, etc.) need to be escaped with \.
      • idx: Start from 0. You can select part of column numbers, that is, filter the parsed columns (for example, the second column in the example is not needed).

defaultLineSchemaJson Parsing

When there is no parsing definition of the file in dbsJson, this configuration is used by default. The default configuration is to get the first column in a row (if the text file is not split by colSep, the whole row of data is deemed as one column). You can define the default parsing rules by modifying this parameter value to obtain the same column as the columns in dbsJson.

  • Default configuration:
    [
    {
    "column":"line",
    "jdbcType":12,
    "typeName":"TEXT"
    }
    ]

Appendix

Supported TypeNames of File Schema

Type NameDescription
booleantrue/false
integer32-bit signed integer
bigint64-bit signed integer
decimalExact decimal, such as 11.1111
dateDate type, such as '2025-11-11'
timeTime type, such as '11:11:11'
timestampTimestamp, such as '2025-11-11 11:11:11'
varcharVariable length string, no more than 2048 bytes. For a string with more than 2048 bytes, please use text.
textLong text, corresponding to database clob, text, etc.
bytesBinary streams, such as images and videos, corresponding to databases blob and binary.