Dataiku Formula 가이드
Formula language — Dataiku DSS 11 documentation
Formula language
- Basic usage
- Reading column values
- Variables typing and autotyping
- Boolean values
- Operators
- Array and object operations
- Object notations
- DSS variables
- Array functions
- Boolean functions
- Date functions
- Math functions
- Object functions
- String functions
- Geometry functions
- Value access functions
- Control structures
- Tests
DSS includes a language to write formulas, much like a spreadsheet.
Formulas can be used:
- In data preparation, to create new columns, filter rows or flag rows
- More generally, to filter rows in many places of DSS:
- In the Filtering recipe, to filter rows
- In Machine Learning, to define the extracts to use for train and test set
- In the Python and Javascript APIs, to obtain partial extracts from the datasets
- In the Public API, to obtain partial extracts from the datasets
- In the grouping, window, join and stack recipes, to perform pre and post filtering
Basic usage
Formulas define an expression that applies row per row.
Assuming that you have a dataset with columns N1 (numeric), N2 (numeric) and S (string) , here are a few example formulas:
- 2 + 2
- N1 + N2
- min(N1, N2) # Returns the smallest of N1 and N2
- replace(S, 'old', 'new') # Returns the value of S with ‘old’ replaced by ‘new’
- if (N1 > N2, 'big', 'small') # Returns big if N1 > N2, small otherwise
Note
We also have a How-To that lists common use cases with examples.
Reading column values
In almost all formulas, you’ll need to read the values of the columns for the current row.
When the column has a “simple” name (i.e: starting by a letter, contains only letters, numbers and underscores), you just need to use the name of the column in the formula: N1 + 4
For other cases, you can use:
- val("column with spaces") : returns the value of “column with spaces”
- strval("column with spaces") : returns the value of “column with spaces” as a string
- numval("column with spaces") : returns the value of “column with spaces” as a number
Warning
mycolumn evaluates to the value of the column named mycolumn
Thus, strval(mycolumn) would not work because it would try to read as a string the value of the column whose name is the value of the mycolumn column.
Instead, use strval("mycolumn").
You may also need to use strval to avoid DSS automatically typing a numerical variable. See below for more details.
Variables typing and autotyping
Variables in the formula language can have one of the following types: string, integer, decimal, array, object, boolean, date
Regardless of the schema of the dataset, or the meanings of the columns (in the case of working on dynamically-generated columns), the following rules are applied:
- If column values are “standard” decimal, they are automatically parsed to decimal
- If column values are “standard” integer, they are automatically parsed to integer
- Else, they are kept as string
Columns containing dates, arrays, objects or booleans are not automatically converted to the matching Formula type. However:
- all functions that require an array will automatically attempt to convert a string input to array (using the regular DSS JSON syntax)
- all functions that require a date will automatically attempt to convert a string input to date (using the ISO-8601 format)
In other words, if the column “begin_date” is a date (and thus contains properly-formatted ISO-8601):
- type(begin_date) returns “string”
- inc(begin_date, 2, "days") works as expected because the inc function performed automatic conversion to date
Avoiding auto-typing
There are some cases where auto-typing is not a good thing. For example, a value like 012345 would be automatically converted to the number 12345.
- To avoid autotyping to numerical, use the strval("mycolumn") function (see above)
- To manually obtain an array or object, use the parseJson function
- To manually obtain a number, use the numval("mycolumn") function
- To manually obtain a date, use the asDate function
- To manually obtain a boolean, use the asBool function
Boolean values
The formula language uses “True” and “False”, with quotes, as boolean values for true and false.
Operators
The formula language supports the classical arithmetic operators:
- Regular math operators: +, -, *, /
- Comparison operators (evaluate to booleans): >, >=, <, <=, ==, !=
- Arithmetic operators: // (integer division) and % (modulo)
- Boolean operators: &&, ||
- The + operator also performs string concatenation.
- The comparison operators (and only them) can operate on dates. For dates arithmetic, see the diff and inc functions.
Array and object operations
Formula support accessing array elements and object keys using the traditional Python/Javascript syntax:
- array[0]
- object["key"]
- object.key (only valid if ‘key’ is a valid identifier, i.e. matches [A-Za-z0-9_]*)
Note: this requires that you actually have an array or object. You might need to use the parseJson function (see above paragraph about typing)
Object notations
For all functions, you can use them the “regular” way: replace(str, 'a', 'b') or in the “object” way: str.replace('a', 'b')
In object notation, the first argument to the function is replaced by its ‘context’.
For example, the two syntaxes are equivalent:
length(trim(replace(foo, 'a', 'b')))
foo.replace('a', 'b').trim().length()
DSS variables
You can retrieve the value of DSS variables via the syntax ${variable_name} or variables["variable_name"]. Both syntaxes have slightly different behaviors.
variables[“variable_name”]
DSS evaluates variables["variable_name"] as JSON during the formula evaluation. If there is no variable defined with the supplied name, an empty cell is returned.
For example, given the following global variables
{
"max_height": 500,
"warning_msg": "too tall",
"msgs": {"ok":"OK", "over":"too tall"}
}
The formula if(height < variables["max_height"], "OK", variables["warning_msg"]) is evaluated as if(height < 500.0, "OK", "too tall"). Note that the max_height is evaluated as 500.0 as JSON consider all numbers to be decimal.
The formula if(height < variables["max_height"], "OK", variables["msgs"]["over"]) is evaluated as if(height < 500.0, "OK", "too tall").
${variable_name}
During processing DSS replaces the ${variable_name} placeholder with its verbatim content and then evaluate the formula. If there is no variable defined with the supplied name, an error is returned.
For example, given the following global variables:
{
"max_height": 500,
"warning_msg": "too tall",
"msgs": {"ok":"OK", "over":"too tall"}
}
The formula if(height < ${max_height}, "OK", "${warning_msg}") is converted into if(height < 500, "OK", "too tall") before being evaluated. Note that quotes have been added around ${warning_msg} to output an actual string.
The formula if(height < ${max_height}, "OK", ${warning_msg}) is converted into if(height < 500, "OK", too tall) before being evaluated and will therefore issue a syntax error.
The formula if(height < ${max_height}, "OK", ${msgs}["over"]) is converted into if(height < 500, "OK", {"ok":"OK", "over":"too tall"}["over"]) before being evaluated and will therefore issue a syntax error.
The formula if(height < ${max_height}, "OK", ${msgs.over}) will throw a syntax error as there is no variable named “msgs.over”.
The formula if(height < ${max_height}, "OK", parseJson('${msgs}')["over"] is converted into if(height < 500, "OK", parseJson('{"ok":"OK", "over":"too tall"}')["over"]) before being evaluated and will be evaluated as if(height < 500.0, "OK", "too tall").
Array functions
arrayContains(array a, item) boolean
Returns whether the array a contains item
arrayContains([1, 2, 3], 5) returns false
arrayDedup(array a) array
Returns array a with duplicates removed
arrayDedup([0, 1, 0, 7]) returns [0,1,7]
arrayIndexOf(array a, item) int
Returns the index (starting from 0) of item in the array a, or -1 if the item is not found in the array.
arrayIndexOf([1 , 0], 2) returns -1
arrayLen(array a) int
Returns the length of the array a
arrayLen([1,2,3]) returns 3
arrayReverse(array a) array
Reverses array a
arrayReverse([1,2,3]) returns [3,2,1]
arraySort(array a) array
Sorts array a
arraySort(["b", "1", "a"]) returns ["1","a","b"]
get(array a, from index, [to index]) Output depends on arguments
Returns o[from, to]. Omitting the optional number to will return what is at the first index specified.
get([1,2,3,4,5], 1, 4) returns [2,3,4]
get([1,2,3,4,5], 1) returns 2
join(array a, string sep) string
Returns the string obtained by joining the elements of array a with the separator sep
For example, the expression join(date_elements, '-') on a column date_elements produces:
date_elementsoutput
[2007, 7, 15] 2007-7-15 [2016, 1, 8] 2016-1-8
objectKeys(object o) array
Returns the keys of an object as an array
objectValues(object o) array
Returns the values of an object as an array
slice(object o, from index, [to index]) Output depends on arguments
If o is an array, returns an array containing the items between from and to (excluded). If o is a string, returns the part of the string between from and to (excluded). Omitting the optional number to will return the remainder of the string or array. Note: slice and substring are identical.
slice([1,2,3,4,5], 1, 4) returns [2,3,4]
slice('hello', 1) returns 'ello'
substring(object o, from index, [to index]) Output depends on arguments
If o is a string, returns the part of the string between from and to (excluded). If o is an array, returns an array containing the items between from and to (excluded). Omitting the optional number to will return the remainder of the string or array. Note: substring and slice are identical.
substring('0123456', 2, 5) returns '234'
substring([1,2,3,4,5,6], 3) returns [4,5,6]
Boolean functions
asBool(o) boolean
Returns o converted to a boolean.
asBool(0) returns false
asBool('yes') returns true
isFalse(boolean b) boolean
Returns whether b is false.
isFalse('false') returns false (the argument is a string, not a boolean, so this method returns false)
isFalse(asBool(0)) returns true
isTrue(boolean b) boolean
Returns whether b is true.
isTrue('true') returns false (the argument is a string, not a boolean, so this method returns false)
isTrue(asBool(1)) returns true
and(boolean a, boolean b) boolean
Evaluates logical AND (conjunction) on several statements. All conditions need to be fulfilled for the function to return true. Is equivalent to a && b.
and(1==1, 3<4) returns true
not(boolean b) boolean
Evaluates logical NOT (negation) of a statement, returning the opposite of b.
not(1!=1) returns false
or(boolean a, boolean b) boolean
Evaluates logical OR (disjunction) on several statements. At least one condition needs to be fulfilled for the function to return true. Is equivalent to a || b.
or(1==1, 3>4) returns true
Date functions
asDate(object o, [format1, …]) date
Returns o converted to a date, based on the format(s) provided. If your column contains more than one date format, you can give an ordered list of possible formats. If you don’t give a format, ISO-8601 is used by default (see more examples of the syntax for ISO-8601). Here is some example syntax:
- ‘y’ (Year)
- ‘M’ (month in year)
- ‘w’ (week in year)
- ‘W’ (week in month)
- ‘d’ (day in month)
- ‘D’ (day in year)
- ‘E’ (day name in week)
- ‘u’ (day number in week. Monday =1)
- ‘H’ (hour in day, 0-23)
- ‘m’ (minute)
- ‘s’ (second)
- S (millisecond)
asDate('2020-04-15', 'yyyy-MM-dd') returns 2020-04-15T00:00:00.000Z
datePart(date d, string part, [timezone]) Output depends on arguments
Extracts a date component from date d. Returned date components are always in local timezone–unless the optional timezone is included–and in English (see a complete list of supported timezones.). The available components are:
- ‘years’ (or ‘year’)
- ‘isoWeekYear’ gives the ISO week-numbering year
- ‘months’ (or ‘month’)
- ‘weekOfYear’ gives the week of the year (as per server locale’s convention)
- ‘isoWeekOfYear’ gives the ISO week number of the year
- ‘weeks’ (or ‘week’ or ‘w’)
- ‘isoWeekOfMonth’ gives the ISO week number of the month
- ‘days’ (or ‘day’ or ‘d’)
- ‘weekday’ (the name of the day of the week, capitalized)
- ‘dayofweek’ (day of the week as a number. Monday=1, Tuesday=2, … Sunday=7)
- ‘hours’ (or ‘hour’ or ‘h’)
- ‘minutes’ (or ‘minute’ or ‘min’)
- ‘seconds’ (or ‘second’ or ‘s’)
- ‘unixTime’ (number of seconds since epoch)
- ‘millisecond’ (or ‘millisecond’, or ‘ms’)
- ‘time’ (number of milliseconds since epoch)
datePart('2020-04-15T00:00:00.000Z', 'weekday') returns "Wednesday"
datePart('2020-01-01T00:00:00.000Z', 'year', '-08:00') returns 2019
diff(date d1, date d2, [string unit]) number
Returns the difference between two dates expressed in given time units. The default unit of time is days. The available units are:
- ‘years’ (or ‘year’)
- ‘months’ (or ‘month’)
- ‘weeks’ (or ‘week’ or ‘w’)
- ‘days’ (or ‘day’ or ‘d’)
- ‘hours’ (or ‘hour’ or ‘h’)
- ‘minutes’ (or ‘minute’ or ‘min’)
- ‘seconds’ (or ‘second’ or ‘s’)
diff('2019-03-15T00:00:00.000Z', '2020-04-15T00:00:00.000Z', 'month') returns -13
inc(date d, number value, string unit) date
Returns d incremented or decremented by the number value in the specified unit of time; a decimal number value will be truncated to an integer. The available units are:
- ‘years’ (or ‘year’)
- ‘months’ (or ‘month’)
- ‘weeks’ (or ‘week’ or ‘w’)
- ‘days’ (or ‘day’ or ‘d’)
- ‘hours’ (or ‘hour’ or ‘h’)
- ‘minutes’ (or ‘minute’ or ‘min’)
- ‘seconds’ (or ‘second’ or ‘s’)
inc('2020-04-15T00:00:00.000Z', -3, 'week') returns 2020-03-25T00:00:00.000Z
now() date
Returns the current time
trunc(date d, string unit) date
Returns d truncated to the unit specified. The available units are:
- ‘years’ (or ‘year’)
- ‘months’ (or ‘month’)
- ‘weeks’ (or ‘week’ or ‘w’)
- ‘days’ (or ‘day’ or ‘d’)
- ‘hours’ (or ‘hour’ or ‘h’)
- ‘minutes’ (or ‘minute’ or ‘min’)
- ‘seconds’ (or ‘second’ or ‘s’)
trunc('2020-04-03T07:47:45.245Z', 'month') returns 2020-04-01T00:00:00.000Z
Math functions
abs(number d) number
Returns the absolute value of a number
abs(-7) returns 7.0
acos(number d) number
Returns the arc cosine of an angle, in the range 0 through PI
acos(-1) returns 3.141592654
asin(number d) number
Returns the arc sine of an angle in the range of -PI/2 through PI/2
asin(1) returns 1.570796327
atan(number d) number
Returns the arc tangent of an angle in the range of -PI/2 through PI/2
atan2(number x, number y) number theta
Converts rectangular coordinates (x, y) to polar (r, theta)
ceil(number n) number
Returns the ceiling of a number
ceil(4.67) returns 5
ceil(-4.67) returns -4
combin(number n, number k) number
Returns the number of combinations for n elements divided into groups of k, n!/k!(n-k)!
combin(6, 2) returns 15
cos(number d) number
Returns the trigonometric cosine of an angle
cos(0) returns 1.0
cosh(number d) number
Returns the hyperbolic cosine of a value
cosh(0) returns 1.0
dec2hex(long) string
Returns an hexadecimal representation of the input number
dec2hex(10) returns a dec2hex(256) returns 100
degrees(number d) number
Converts an angle from radians to degrees.
degrees(PI()) returns 180.0
even(number n) number
Rounds the number up to the nearest even integer
even(3) returns 4.0
even(2.3) returns 4.0
even(-2.3) returns -2.0
even(-3) returns -2.0
exp(number n) number
Returns the exponential of a number
exp(2) returns 7.38905609893065
fact(number i) number
Returns the factorial of a number i
fact(4) returns 24
factn(number i, number d) number
Returns the factorial of a number i, omitting every dth item from the multiplication
factn(7, 3) returns 28
floor(number d) number
Returns the floor of a number
floor(4.7) returns 4
gcd(number d, number e) number
Returns the greatest common denominator of the two numbers
gcd(21, 28) returns 7.0
hash(string) long
Returns a 64 bits numerical hash of the input (not crypto-secure)
hash("goo") returns 2774880816139997119
hex2dec(string) long
Returns a decimal representation of an hexadecimal string
hex2dec("a") returns 10
lcm(number d, number e) number
Returns the least common multiple of the two numbers
lcm(20, 42) returns 420.0
ln(number n) number
Returns the natural log of a number
ln(exp(1)) returns 1.0 ln(2.72) returns 1.000631880307906
log(number n) number
Returns the base 10 log of a number
log(100) returns 2.0
max(a, b, …) Output depends on arguments
Returns the greater of two or more numbers, two or more strings, or the more recent of two or more dates
max(-1, 3) returns 3
max('luke', 'leia') returns 'luke'
max('2020-01-01T00:00:00.000Z', '2021-01-01T00:00:00.000Z') returns 2021-01-01T00:00:00.000Z
min(a, b, …) number
Returns the smaller of two or more numbers, two or more strings, or older of two or more dates
min(-1, 3) returns -1.0
min('luke', 'leia') returns 'leia'
min('2020-01-01T00:00:00.000Z', '2021-01-01T00:00:00.000Z') returns 2020-01-01T00:00:00.000Z
mod(number a, number b) number
Returns a modulus b
mod(5, 3) returns 2
mod(7.8, 3) returns 1
multinomial(number d1, number d2 …) number
Returns the multinomial of a series of numbers, (sum(d1, d2, d3,...))! / d1! * d2! * d3! ...
multinomial(1, 1, 2, 1) returns 60
odd(number d) number
Rounds the number up to the nearest odd integer
odd(5.3) returns 7.0
odd(-5.3) returns -5.0
PI() number
Returns the value of PI
pow(number a, number b) number
Returns a to the power of b
pow(2, -1) returns 0.5
quotient(number numerator, number denominator) number
Returns the integer portion of a division
quotient(7, 2) returns 3.0
radians(number d) number
Converts an angle in degrees to radians
radians(180) returns 3.141592653589793
rand([int min], [int max]) double or int
Without arguments, returns a random float between 0 and 1. With min and max arguments, returns a random integer between min (inclusive) and max (exclusive)
round(number n) number
Returns the rounding of number to the nearest integer
round(3.5) returns 4.0
round(-3.5) returns -3.0
sin(number d) number
Returns the trigonometric sine of an angle
sin(radians(90)) returns 1.0
sinh(number d) number
Returns the hyperbolic sine of an angle
sinh(0) returns 0.0
sqrt(number n) number
Returns the square root of number n.
sqrt(81) returns 9.0
sum(array a) number
Sums the numbers of an array. Skips non-number elements from the array.
sum([1, 2, "string", 3]) returns 6.0
tan(number d) number
Returns the trigonometric tangent of an angle
tan(0) returns 0.0
tanh(number d) number
Returns the hyperbolic tangent of a value
toNumber(o) number
Returns o converted to a number
toNumber("5") returns 5
Object functions
get(object o, string field, [string defaultValue]) Output depends on arguments
Returns o[from]. If o[from] is empty and default has been provided, returns default.
get(parseJson('{"name":"joe", "age":42}'), "age") returns 42
get(parseJson('{"name":"joe"'), "age", 37) returns 37
hasField(object o, string name) boolean
Returns whether o has the field name
hasField(parseJson('{""name"": ""joe"", ""age"": 42 }'), ""age"") returns true
htmlAttr(Element e, String s) string
Selects a value from an attribute on an Html Element
htmlText(Element e) string
Selects the text from within an element (including all child elements)
htmlAttr(select(parseHtml('<div><a href="www.dataiku.com"></div>'), '[href]')[0], "href") returns "www.dataiku.com"
innerHtml(Element e) string
The innerHtml of an HTML element
jsonize(value) JSON literal value
Quotes a value as a JSON literal value
objectDel(object o, key, [key…]) object
Removes one or several keys from an object and returns it. The keys must not be null
objectDel(parseJson('{"firstName": "birdie", "company": "Dataiku" }'), 'firstName') returns {"company":"Dataiku"}
objectNew(k1, v1, k2, v2, …) object
Creates a new object, optionally pre-filled with key/values. Must get an even number of arguments, as successive key-value pairs. Giving 0 arguments is possible and will return an empty object (you can use objectPut to add to it)
objectNew("firstName", "birdie", "company", "Dataiku") returns {"firstName": "birdie", "company": "Dataiku" }
objectPut(object o, key, value) object
Adds a key/value pair to an object and returns it. The key must not be null
ownText(Element e) string
Gets the text owned by this HTML element only; does not get the combined text of all children.
parseHtml(string s) HTML object
Parses a string as HTML
parseJson(string s) object or array
Parses a JSON string as an object or array
select(Element e, String s) HTML Elements
Selects an element from an HTML elementn using selector syntax
type(object o) string
Returns the type of o
type(3.126) returns number
String functions
chomp(string s string tail) string
Removes tail from the end of s if it’s there, otherwise leaves it alone.
chomp("foobar", "bar") returns "foo"
coalesce(value1, value2, …) Output depends on arguments
Returns the first non-empty value.
coalesce("", "foo") returns "foo"
concat((object a1, [object a2, …])) Output depends on arguments
Returns a string of concatenated values. If one of your columns contains leading 0 (zeros), wrap it in a strval('column_name') to preserve them.
concat("Birds", " ", "fly") returns "Birds fly"
concat(1, 2, 3) returns "123"
contains(string s, string frag) boolean
Returns whether s contains frag
contains("hello world ", "llo") returns true
endsWith(string s, string tail) boolean
Returns whether s ends with tail
endsWith("hello world", "rld") returns true
escape(string s, string mode) string
Escapes a s using the escaping mode specified. Supported modes: ‘html’, ‘xml’, ‘csv’, ‘url’, javascript’. Note that the escape uses the standard java URLEncoder for urls and StringEscapeUtils for others
format(string format, object… args) string
Formats a string using printf-like formatting using the Java Formatter syntax.
'%4d-%02d'.format(2004,2) returns '2004-02'
fromBase64(string s, [string charset]) string
Returns the string whose Base64 representation is given. By default, the string is read using the UTF-8 charset.
fromBase64('SA==') returns "H"
get(string s, from index, [to index]) Output depends on arguments
Returns s.substring(from, to). Omitting the optional number to will return what is at the first index specified.
get('Oh no, kittens!', 0, 5) returns 'Oh no'
get('Oh no, kittens!', 1) returns 'h'
indexOf(string s, string sub) number
Returns the index of the first occurrence of sub in s. Index begins at 0. Returns -1 if there is no such occurrence.
indexOf("hello world", "world") returns 6
lastIndexOf(string s, string sub) number
Returns the index of the last occurrence of sub in s. Index begins at 0. Returns -1 if there is no such occurrence.
lastIndexOf("hello world", "o") returns 7
length(array or string o) number
Returns the length of o
length("hello world") returns 11
length([4,5,6]) returns 3
match(string a, string or regexp) array of strings
Returns an array of the matching groups found in s. Groups are designated by () within the specified string or regular expression.
match('hello world', 'he(.*)wo(rl)d') returns ["llo ","rl"]
md5(string s) string
Returns the MD5 hash of a string
md5('hi') returns "49f68a5c8493ec2c0bf489821c21fc3b"
partition(string s, string or regex frag, [boolean omitFragment]) array
Returns an array of strings [a,frag,b] where a is the part before the first occurrence of frag in s and b is the part after the occurrence. If omitFragment is true, frag is not returned in the array.
partition(""hello"", ""he"") returns ["""",""he"",""llo""]
partition(""hello"", ""he"", asBool(1)) returns ["""",""llo""]
partition(""hello"", /.l/)" returns [""h"",""el"",""lo""]
replace(string s, string or regex f, string replacement) string
Replaces all occurrences of substring / regex f found in string s with the replacement string.
replace('hello world', 'hel', 'a') returns "alo world"
replace(""Oh my!"", /\w/, 'x') returns "xx xx!"
replaceChars(string s, string f, string r) string
Returns the string obtained by replacing all character in s that match f with the character in r at that same position. The function can be used to delete characters by replacing them with nothing.
replaceChars('abcba', 'bc', 'BC') returns aBCBa replaceChars('abcba', 'bc', 'Z') returns aZZa
rpartition(string s, string or regex frag, [boolean omitFragment]) array
Returns an array of strings [a,frag,b] where a is the part before the last occurrence of frag in s and b is the part after the occurrence. If omitFragment is true, frag is not returned in the array.
rpartition("hello world", "o") returns ["hello w","o","rld"]
sha1(string s) string
Returns the SHA-1 hash of a string
sha1('goo') returns 3f95edc0399d06d4b84e7811dd79272c69c8ed3a
sha256(string s) string
Returns the SHA-256 hash of a string
sha512(string s) string
Returns the SHA-512 hash of a string
split(string s, string or regex sep, [boolean preserveAllTokens]) array
Returns the array of strings obtained by splitting s with separator sep. If preserveAllTokens is true, then empty segments are preserved.
split(""hello"", ""he"") returns [""llo""]
split(""hello"", ""he"", asBool(1)) returns ["""",""llo""]
split(""hello"", /.l/) returns [""h"",""lo""]
split("Hello world!", /\s+/) returns ["Hello","world!"]
splitByCharType(string s) array
Returns an array of strings obtained by splitting s grouping consecutive chars by their unicode type
splitByCharType("Hello_world 101!?!") returns ["H","ello","_","world"," ","101","!?!"]
splitByLengths(string s, number length1, […]) array
Returns an array of strings obtained by splitting s into substrings, each with its own specified length.
splitByLengths("Hello world", 1, 2, 3, 4) returns ["H","el","lo ","worl"]
startsWith(string s, string sub) boolean
Returns whether s starts with sub
startsWith("Hello world", "He") returns true
strip(string s) string
Returns copy of the string, with leading and trailing whitespace omitted. This function is the same as trim.
strip(" Hello World ") returns "Hello World"
toBase64(string s, [string charset]) string
Returns the Base64 representation of a string. By default, the string is written using the UTF-8 charset.
toBase64("H") returns "SA=="
toLowercase(string s) string
Converts a string to lowercase
toLowercase("HELLO WORLD") returns "hello world"
toString(o, string format (optional)) string
Returns o converted to a string
toString(5) returns "5"
toTitlecase(string s) string
Converts a string to titlecase
toTitlecase("hello world") returns "Hello World"
toUppercase(string s) string
Converts a string to uppercase
toUppercase("hello world") returns "HELLO WORLD"
trim(string s) string
Returns copy of the string, with leading and trailing whitespace omitted.
trim(" Hello World ") returns "Hello World"
unescape(string s, string mode) string
Unescapes all escaped parts of the string depending on the given escaping mode. Available modes: ‘html’, ‘xml’, ‘csv’, ‘url’, ‘javascript’
unicode(string s) string
Returns the input string as an array of the unicode codepoints (numbers)
unicode("Hi!") returns [72,105,33]
unicodeType(string s) string
Returns an array of strings describing each character of the input string in their full unicode notation
unicodeType("y 0H?") returns ["lowercase letter","space separator","decimal digit number","uppercase letter","other punctuation"]
uuid() string
Returns a type 4 (pseudo randomly generated) UUID. The UUID is generated using a cryptographically strong pseudo random number generator.
uuid() returns 7af14645-bcd9-4af2-bca8-390fc67e9a0b
Geometry functions
geoBuffer(geometry geom, double distance, [int quadrantSegment]) string
Returns a geometry that represents all points whose distance from this geometry is less than or equal to distance. The distance parameter can be either a value or a column name from the dataset. The distance unit depends on the CRS of the given geometries (e.g., degrees for SRID=4326). A negative distance can be used with polygons, which will shrink the polygon rather than expand it. Specify an optional quadrantSegment value to set the number of segments to approximate a quarter circle (the default is 8)
geoContains(geometry geomA, geometry geomB) bool
Compute the boolean {geomA contains geomB} for two input geometries A and B.
The implementation of this function depends on the engine it is run on but its definition is standard: “Geometry A contains Geometry B if and only if no points of B lie in the exterior of A, and at least one point of the interior of B lies in the interior of A.” (https://postgis.net/docs/ST_Contains.html)
A polygon A does not contain a point B if B is exactly on its boundary.
geoContains("POLYGON((0 0,3 0,0 3,0 0))", "POINT(1 1)") returns true
geoContains("POLYGON((0 0,3 0,0 3,0 0))", geometry) returns the evaluation of the boolean predicate for each input geometry row of dataset column geometry
geoEnvelope(geometry geom) string
Returns the envelope of a geometry, i.e., the minimum bounding box containing this geometry.
geoMakeValid(geometry geom) string
Returns a valid representation of an invalid geometry. Valid geometries remain unchanged.
Note
This function can return slightly different results whether the recipe engine is PostgreSQL or a local stream. If the method is unable to make a geometry valid, it returns an empty cell.
geoSimplify(geometry geom, double toleranceDistance) string
geoWithin(geometry geomA, geometry geomB) boolReturns a simpler geometry based on the Douglas-Peucker algorithm, with respect to a non-negative toleranceDistance parameter. The simplification of the geometry will result in a geometry with fewer vertices. A vertex is removed only if the distance between this vertex and the edge resulting from removing this vertex is within the specified toleranceDistance. The toleranceDistance parameter can be either a value or a column name from the dataset. The toleranceDistance unit depends on the CRS of the given geometries (e.g., degrees for SRID=4326).
Returns a result opposite to geoContains (see above). geomA is within geomB if geomB contains geomA.
Value access functions
numval(object o) number
Returns the numercial value of a column. If the value is not numerical, an empty value is returned.
Note
Use numval when your column name contains spaces or periods to ensure DSS parses it properly as in numval("my.column") or numval("my column"). Remember to always write the column name in “quotes” when using numval.
strval(object o, [defaultValue]) string
Returns the string value of a column. If the value is not a string, an empty value is returned. Specify an optional defaultValue to replace empty cells with defaultValue.
Note
Use strval when your column name contains spaces or periods to ensure DSS parses it properly as in strval("my.column") or strval("my column"). Remember to always write the column name in “quotes” when using strval.
val(object o, [defaultValue]) Output depends on argument
Returns the value of a column. Specify an optional defaultValue to replace empty cells with defaultValue. The type of the resulting column is auto-detected (use strval or numval to force it).
Note
Use val when your column name contains spaces or periods to ensure DSS parses it properly as in val("my.column") or val("my column"). Remember to always write the column name in “quotes” when using val.
Control structures
Control structures allow you to perform advanced operations.
Beware: control structures cannot use object notation !
filter(array a, variable v, expression e) array
Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array if the result is truish.
filter(['aa', 'bb', 'cc', 'ab', 'bc'], item, item.startsWith('a')) returns ["aa","ab"]
Note
filter(myarray, v, v < 2) is equivalent to this Python syntax: [v for v in myarray if v < 2] and returns the array with only the elements below 2
forEeach(array a, variable v, expression e) array
Evaluates expression a to an array. Then for each array element, binds its value to variable name v, evaluates expression e, and pushes the result onto the result array.
forEach(['aa', 'bb', 'cc', 'ab', 'bc'], item, item.startsWith('a')) returns [true,false,false,true,false]
Note
forEach(myarray, v, v + 2) is equivalent to this Python syntax: [v+2 for v in myarray]
forEachIndex(array a, variable i, variable v, expression e) array
Evaluates expression a to an array. Then for each array element, binds its value to variable name v and its index to variable name i, evaluates expression e, and pushes the result onto the result array.
forEachIndex(['aa', 'bb', 'cc', 'ab', 'bc'], index, item, index < 2 || item.startsWith('a')) returns [true,true,false,true,false]
Note
forEachIndex(myarray, i, v, v + i) is equivalent to this Javascript syntax: myarray.map(function(v, i) { return v+i ;})
forRange(from, to, step, variable v, expression e) array
Iterates, starting at from, incrementing by step each time while less than to. At each iteration, binds the variable v to the iteration value, evaluates expression e, and pushes the result onto the result array.
forRange(0, 21, 3, v, v) returns [0,3,6,9,12,15,18] forRange(0, 21, 3, v, v*5) returns [0,15,30,45,60,75,90]
Note
forRange(0, 100, 3, v, v * 2) is equivalent to this Python syntax: [v * 2 for v in xrange(0, 100, 3)] and returns [0, 6, 12, 18 …. 198]
if(boolean, expression_true, expression_false) Output depends on expression
Evaluates to expression_true if the condition is true, to expression_false otherwise
if(3>2, "I heart #s", "my brain hurts") returns "I heart #s"
objectFilter(expression a, variable k, variable v, expression test) object
Evaluates expression a to an object. Then for each element (k, v) of this object, binds its key to variable name k, its value to variable name v, evaluates expression test which should return a boolean. If the boolean is true, pushes (k, v) onto the result object.
objectFilter('{"fName":"Joe", "lName":"Smith", "age":42 }', k, v, k.contains("Name")) returns {"fName":"Joe","lName":"Smith"}
switch(expression_to_match, match_1, return_1, match_2, return_2, …, [return_default]) Output depends on expression
Compares expression_to_match to match_1, then to match_2, etc. When match is found, returns the corresponding return_i. If no match is found, returns return_default, or nothing if there is no return_default.
switch("Paris", "Paris", 1, "New York", 2, 0) returns 1
switch("Berlin", "Paris", "1", "New York", "2", "Other") returns "Other"
switch(col1, col2+col3, "sum", "Other") returns "sum" if value of column col1 equals value of col2+col3, "Other" otherwise
switch("true", col1 >= 10, "good", col1 >= 5 "okay", "bad") returns "good", "okay", or "bad" depending on col1 value
with(expression o, variable v, expression e) Output depends on expression
Evaluates expression o and binds its value to variable name v. Then evaluates expression e and returns that result. The with control allows you to “split” a very big expression into more manageable chunks. It also makes reusing the result of a complex computation easier to read and faster to process.
with("european union".split(" "), a, a.length()) returns 2
with("european union".split(" "), a, forEach(a, v, v.length())) returns [8,5]
Tests
Beware: tests cannot use object notation
isBlank(expression o) boolean
Returns whether o is null or an empty string. IsBlank() returns true when value is null or empty (“”). It does not consider a string made of blank spaces as blank; it considers an expression with an error as blank.
isBlank("") returns true
isBlank(" ") returns false
isBlank(abs("a")) returns true
isError(expression o) boolean
Returns whether o is an error
isError(abs("a")) returns true
isNonBlank(expression o) boolean
Returns whether o is non-null or a not-empty string. IsNonBlank() returns false when value is null or empty (“”). It considers a string made of blank spaces as NonBlank; it considers an expression with an error as blank.
isNonBlank("") returns false
isNonBlank(" ") returns true
isNonBlank(abs("a")) returns false
isNotNull(expression o) boolean
Returns whether o is not null. IsNotNull() returns false when value is null or empty (“”). It does not consider a string made of blank spaces as not null, and considers an expression with an error as not null.
isNotNull("") returns false
isNotNull(" ") returns false
isNotNull(abs(a")) returns true
isNull(expression o) boolean
Returns whether o is null or an empty string. IsNull() returns true when value is null or empty (“”). It considers a string made of blank spaces as null, and does not consider an expression with an error as null.
isNull("") returns true
isNull(" ") returns true
isNull(abs("a")) returns false
isNumeric(expression o) boolean
Returns whether o can represent a number
'Dataiku' 카테고리의 다른 글
데이터이쿠(Dataiku), AI MLOps (0) | 2023.10.27 |
---|---|
기업 데이터 분석 도구 데이터이쿠(Dataiku) (0) | 2023.10.26 |
데이터이쿠 가이드, 단축키 웹페이지 (0) | 2023.01.03 |