Create a custom Transformer in PySpark ML

Can I extend the default one? Not really. Default Tokenizer is a subclass of and, same as other transfromers and estimators from, delegates actual processing to its Scala counterpart. Since you want to use Python you should extend directly. import nltk from pyspark import keyword_only