Create a custom Transformer in PySpark ML

Can I extend the default one? Not really. Default Tokenizer is a subclass of pyspark.ml.wrapper.JavaTransformer and, same as other transfromers and estimators from pyspark.ml.feature, delegates actual processing to its Scala counterpart. Since you want to use Python you should extend pyspark.ml.pipeline.Transformer directly. import nltk from pyspark import keyword_only ## < 2.0 -> pyspark.ml.util.keyword_only from pyspark.ml … Read more