[Campaign-news] Memory limits

Mon Feb 1 19:04:32 PST 2010

I think it's reasonable to just start with approach 1). In general, we might
think about some kind of locality-sensitive approach to the clustering. It's
fairly well-studied territory in high-performance computing, and there should
be work related specifically to data clustering.

Bill

Quoting Kai Kohlhoff <kohlhoff at stanford.edu>:

> There was another thought I had.  Since memory is limited, there is   
> only so much data that we can have available in one go.  Since most   
> clustering algorithms (that we implement anyway) require several   
> passes over the data, it would be inefficient to have to transfer   
> data between host and device memory once the device memory has   
> filled up.  Do we:
>
> 1) restrict the amount of data that a given algorithm can handle   
> based on the size of GPU global memory (and how? abort with error   
> message if data too large?)
> or
> 2) go through the pain of finding efficient variants to the   
> clustering algorithms that require a minimal amount of memory   
> transfers
>
> I would think that 1) is the better solution for now, but should we   
> later do 2), or leave it to others to contribute their own   
> algorithms for larger data sets?  Let me know what you think.
>
> Thanks,
> Kai
>
>
>
> On Jan 27, 2010, at 1:47 PM, Kai Kohlhoff wrote:
>
>> Hi Marc,
>>
>> Yes, it was a good evening, you guys are a pleasant crowd!
>>
>> I agree with your next steps, and I will see that I stick to the   
>> format that you have already created.  I have been trying to   
>> simplify the code that we have and am really eager to put it into   
>> the repository.  There is still another project that I have to work  
>>  on until tomorrow, but then I'll get to it.
>>
>> I was thinking that we should pull the distance kernels out of the   
>> current clustering code.  For proper modularity, these should be   
>> called separately in each iteration and a distance matrix should be  
>>  provided to the clustering kernel in each iteration.  Also, the  
>> I/O  could be put into separate subroutines.  It might be useful,  
>> if  ultimately a user could simply write C/C++-code and the GPU   
>> functionality would be hidden.
>>
>> Something like:
>>
>>
>> #include "campaign.h"
>>
>> campaign.checkPlatform();   // checks which, if any, GPU is present
>> data = campaign.readData(file, format);  // read data
>> data = campaign.preprocess(data, method);  //  use a selected   
>> method to preprocess data
>> clusters.init(data);		// extracts number of data points,   
>> dimensionality, copies data to GPU
>> for (i = 1:N)	// N iterations, data is kept on GPU between kernel   
>> calls; alternatively use convergence criterium
>> {
>> 	distance = campaign.calcDists(data, clusters, metricType); //   
>> metricType = e.g. "manhattan", "euclidean"
>> 	clusters = campaign.iterate(data, clusters, distance,   
>> algorithmType);  // One iteration of algorithmType = e.g.   
>> "kcenters", "kmeans", "birch"
>> }
>> campaign.printResults(clusters, format); // output clustering results
>>
>>
>> would be great to have.  If you like the idea, maybe we should   
>> start thinking about how we could get there.  I am not sure this   
>> could make it into our '0.5' version that Russ mentioned, but we   
>> could talk about this.
>>
>> It makes sense to have something out asap.  It will be fun to   
>> increase the speed of our clustering code in subsequent iterations,  
>>  but we should start getting people to use it.  I'll try to deposit  
>>  the modules that I have at the end of the week.
>>
>> Bill, I am looking forward to hearing about your profiling work   
>> during our next meeting.  Your findings will surely help me write   
>> more efficient code right from the onset.
>>
>> When should we have our next meeting?  Given that the last one has   
>> been awhile, I suggest not having it more than three weeks from   
>> now.  How does February 19 sound to you?
>>
>> Cheers,
>> Kai
>>
>>
>>
>>
>> On Jan 26, 2010, at 10:21 AM, Marc Sosnick wrote:
>>
>>> Kai:
>>>
>>> It was great seeing you last night.  Thanks for helping me out   
>>> round out the presentation at the meeting.  Sorry we didn't have   
>>> more time to talk about our next steps during dinner, but it was   
>>> quite convivial!
>>>
>>> As we discussed, I had ideas as to what my next steps should be,   
>>> and I just want to get your and Bill's agreement before I start.    
>>> These are in priority order:
>>>
>>> 1) Now that we have a smoke test against which to test, take the   
>>> current code and refactor each clustering method into a proper c++  
>>>  class, with a .cpp and .cu file.  This would also include   
>>> scrubbing the current code of comments and optimizing code (not   
>>> including optimizing memory handling) as if we were presenting it   
>>> to the outside world.  This would significantly help us work   
>>> toward our first release as per Russ' comments last night.
>>> 2) Take any new clustering algorithms you  have and put them into   
>>> the format that we've created up to now and as in (1).
>>> 3) Optimize memory handling and data structures.  This would be   
>>> done in tandem with Bill's profiling work.
>>>
>>> Let me know about those algorithms you have.  Don't worry about   
>>> putting anything in the repository, we can always reorganize the   
>>> repository as we see fit, so just go ahead.  Probably the best way  
>>>  would just to be to create a subdirectory off trunk/dev, put your  
>>>  work in there, and do an svn add directory_name from the parent   
>>> directory of directory_name.
>>>
>>> Again, many thanks!
>>>
>>> Marc
>>> _______________________________________________
>>> Campaign-news mailing list
>>> Campaign-news at simtk.org
>>> https://simtk.org/mailman/listinfo/campaign-news
>>
>> -----------------------------------------------------
>> Kai Kohlhoff, PhD
>> Stanford University
>> School of Medicine, Bioengineering
>> Stanford, CA 94305-5448, USA
>> T: ++1 (650) 724 1575
>> E: kohlhoff at stanford.edu
>>
>> _______________________________________________
>> Campaign-news mailing list
>> Campaign-news at simtk.org
>> https://simtk.org/mailman/listinfo/campaign-news
>
> -----------------------------------------------------
> Kai Kohlhoff, PhD
> Stanford University
> School of Medicine, Bioengineering
> Stanford, CA 94305-5448, USA
> T: ++1 (650) 724 1575
> E: kohlhoff at stanford.edu
>
>