large-data - w3toppers.com

How to read large (~20 GB) xml file in R?

Append lines to a file

Have you tried using the write function? line=”blah text blah blah etc etc” write(line,file=”myfile.txt”,append=TRUE)

Memory-constrained external sorting of strings, with duplicates combined&counted, on a critical server (billions of filenames)

IDK if external sorting with count-merging of duplicates has been studied. I did find a 1983 paper (see below). Usually, sorting algorithms are designed and studied with the assumption of sorting objects by keys, so duplicate keys have different objects. There might be some existing literature on this, but it’s a very interesting problem. Probably … Read more

How to plot with a png as background? [duplicate]

Try this: library(png) #Replace the directory and file information with your info ima <- readPNG(“C:\\Documents and Settings\\Bill\\Data\\R\\Data\\Images\\sun.png”) #Set up the plot area plot(1:2, type=”n”, main=”Plotting Over an Image”, xlab=”x”, ylab=”y”) #Get the plot information so the image will fill the plot box, and draw it lim <- par() rasterImage(ima, lim$usr[1], lim$usr[3], lim$usr[2], lim$usr[4]) grid() lines(c(1, … Read more

Parallel.ForEach can cause a “Out Of Memory” exception if working with a enumerable with a large object

The default options for Parallel.ForEach only work well when the task is CPU-bound and scales linearly. When the task is CPU-bound, everything works perfectly. If you have a quad-core and no other processes running, then Parallel.ForEach uses all four processors. If you have a quad-core and some other process on your computer is using one … Read more

Writing large Pandas Dataframes to CSV file in chunks

Solution: header = True for chunk in chunks: chunk.to_csv(os.path.join(folder, new_folder, “new_file_” + filename), header=header, cols=[[‘TIME’,’STUFF’]], mode=”a”) header = False Notes: The mode=”a” tells pandas to append. We only write a column header on the first chunk.

SELECT COUNT() vs mysql_num_rows();

Use COUNT, internally the server will process the request differently. When doing COUNT, the server will only allocate memory to store the result of the count. When using mysql_num_rows, the server will process the entire result set, allocate memory for all those results, and put the server in fetching mode, which involves a lot of … Read more

Read large data from csv file in php [duplicate]

An excellent method to deal with large files is located at: https://stackoverflow.com/a/5249971/797620 This method is used at http://www.cuddlycactus.com/knownpasswords/ (page has been taken down) to search through 170+ million passwords in just a few milliseconds.

How to store extremely large numbers?

If you already have a boost dependency (which many people these days do), you can use the boost multi-precision library. In fact, it already has an example of a factorial program that can support output up to 128 bits, though extending it further is pretty trivial.

How to read only lines that fulfil a condition from a csv into R?

You could use the read.csv.sql function in the sqldf package and filter using SQL select. From the help page of read.csv.sql: library(sqldf) write.csv(iris, “iris.csv”, quote = FALSE, row.names = FALSE) iris2 <- read.csv.sql(“iris.csv”, sql = “select * from file where `Sepal.Length` > 5”, eol = “\n”)