How to change memory per node for apache spark worker

When using 1.0.0+ and using spark-shell or spark-submit, use the –executor-memory option. E.g. spark-shell –executor-memory 8G … 0.9.0 and under: When you start a job or start the shell change the memory. We had to modify the spark-shell script so that it would carry command line arguments through as arguments for the underlying java application. … Read more

Pandas dataframe groupby text value that occurs in two columns

This seems like a graph problem. You could try to use networkx: import networkx as nx G = nx.from_pandas_edgelist(df, ‘v1’, ‘v2’) clusters = nx.connected_components(G) output: [{‘be’, ‘belong’}, {‘delay’, ‘increase’, ‘decrease’}, {‘analyze’, ‘assay’}, {‘report’, ‘bespeak’, ‘circulate’}, {‘induce’, ‘generate’}, {‘trip’, ’cause’}, {‘distinguish’, ‘isolate’}, {‘infect’, ‘give’}, {‘prove’, ‘result’}, {‘intercede’, ‘describe’, ‘explain’}, {‘affect’, ‘expose’}, {‘restrict’, ‘suppress’}] As graph: Small … Read more

Spark spark-submit –jars arguments wants comma list, how to declare a directory of jars?

In this way it worked easily.. instead of specifying each jar with version separately.. #!/bin/sh # build all other dependent jars in OTHER_JARS JARS=`find ../lib -name ‘*.jar’` OTHER_JARS=”” for eachjarinlib in $JARS ; do if [ “$eachjarinlib” != “APPLICATIONJARTOBEADDEDSEPERATELY.JAR” ]; then OTHER_JARS=$eachjarinlib,$OTHER_JARS fi done echo —final list of jars are : $OTHER_JARS echo $CLASSPATH spark-submit … Read more

Set hadoop system user for client embedded in Java webapp

Finally I stumbled on the constant static final String HADOOP_USER_NAME = “HADOOP_USER_NAME”;` in the UserGroupInformation class. Setting this either as an environment variable, as a Java system property on startup (using -D) or programmatically with System.setProperty(“HADOOP_USER_NAME”, “hduser”); makes Hadoop use whatever username you want for connecting to the remote Hadoop cluster.

MPI: blocking vs non-blocking

Blocking communication is done using MPI_Send() and MPI_Recv(). These functions do not return (i.e., they block) until the communication is finished. Simplifying somewhat, this means that the buffer passed to MPI_Send() can be reused, either because MPI saved it somewhere, or because it has been received by the destination. Similarly, MPI_Recv() returns when the receive … Read more

Scaling solutions for MySQL (Replication, Clustering)

I’ve been doing A LOT of reading on the available options. I also got my hands on High Performance MySQL 2nd edition, which I highly recommend. This is what I’ve managed to piece together: Clustering Clustering in the general sense is distributing load across many servers that appear to an outside application as one server. … Read more