How to change memory per node for apache spark worker

When using 1.0.0+ and using spark-shell or spark-submit, use the –executor-memory option. E.g. spark-shell –executor-memory 8G … 0.9.0 and under: When you start a job or start the shell change the memory. We had to modify the spark-shell script so that it would carry command line arguments through as arguments for the underlying java application.

Pandas dataframe groupby text value that occurs in two columns

This seems like a graph problem. You could try to use networkx: import networkx as nx G = nx.from_pandas_edgelist(df, 'v1', 'v2') clusters = nx.connected_components(G) output: [{'be', 'belong'}, {'delay', 'increase', 'decrease'}, {'analyze', 'assay'}, {'report', 'bespeak', 'circulate'}, {'induce', 'generate'}, {'trip', 'cause'}, {'distinguish', 'isolate'}, {'infect', 'give'}, {'prove', 'result'}, {'intercede', 'describe', 'explain'}, {'affect', 'expose'}, {'restrict', 'suppress'}] As graph: Small

Spark spark-submit –jars arguments wants comma list, how to declare a directory of jars?

In this way it worked easily.. instead of specifying each jar with version separately.. #!/bin/sh # build all other dependent jars in OTHER_JARS JARS=`find ../lib -name '*.jar'` OTHER_JARS="" for eachjarinlib in $JARS ; do if [ "$eachjarinlib" != "APPLICATIONJARTOBEADDEDSEPERATELY.JAR" ]; then OTHER_JARS=$eachjarinlib,$OTHER_JARS fi done echo —final list of jars are : $OTHER_JARS echo $CLASSPATH spark-submit

Set hadoop system user for client embedded in Java webapp

Finally I stumbled on the constant static final String HADOOP_USER_NAME = “HADOOP_USER_NAME”;` in the UserGroupInformation class. Setting this either as an environment variable, as a Java system property on startup (using -D) or programmatically with System.setProperty(“HADOOP_USER_NAME”, “hduser”); makes Hadoop use whatever username you want for connecting to the remote Hadoop cluster.

MPI: blocking vs non-blocking

Blocking communication is done using MPI_Send() and MPI_Recv(). These functions do not return (i.e., they block) until the communication is finished. Simplifying somewhat, this means that the buffer passed to MPI_Send() can be reused, either because MPI saved it somewhere, or because it has been received by the destination. Similarly, MPI_Recv() returns when the receive

Scaling solutions for MySQL (Replication, Clustering)

I've been doing A LOT of reading on the available options. I also got my hands on High Performance MySQL 2nd edition, which I highly recommend. This is what I've managed to piece together: Clustering Clustering in the general sense is distributing load across many servers that appear to an outside application as one server.