Use AWS Glue Python with NumPy and Pandas Python Packages

You can check latest python packages installed using this script as glue job

import logging
import pip
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

if __name__ == '__main__':
    logger.info(pip._internal.main(['list']))

As of 30-Jun-2020 Glue as has these python packages pre-installed. So numpy and pandas is covered.

awscli 1.16.242
boto3 1.9.203
botocore 1.12.232
certifi 2020.4.5.1
chardet 3.0.4
colorama 0.3.9
docutils 0.15.2
idna 2.8
jmespath 0.9.4
numpy 1.16.2
pandas 0.24.2
pip 20.0.2
pyasn1 0.4.8
PyGreSQL 5.0.6
python-dateutil 2.8.1
pytz 2019.3
PyYAML 5.2
requests 2.22.0
rsa 3.4.2
s3transfer 0.2.1
scikit-learn 0.20.3
scipy 1.2.1
setuptools 45.1.0
six 1.14.0
urllib3 1.25.8
virtualenv 16.7.9
wheel 0.34.2

You can install additional packages in glue-python if they are present in the requirements.txt used to build the attaching .whl. The whl file gets collected and installed before your script is kicked-off. I would also suggest you to look into Sagemaker Processing which is more easier for python based jobs. Unlike serveless instance for glue-python shell, you are not limited to 16gb limit there.

Leave a Comment