OpenChess

Posted: **Fri Aug 28, 2015 6:04 pm**

Hi,

I am currently rewriting my engine to use bitboards and to attempt to use parallel search. As mentioned in a previous post, I was trying to improve the speeds of my move generator a while ago. I made some changes and I'm happy with my Perft speeds now. I also decided to switch to magic bitboards.

I have never written multi-threaded code before (I'm using C++), so it's a bit of a steep learning curve. So my question may sound a little confused, or perhaps not even make sense. I am presently trying to teach my self about thread pooling and stuff, but in the meantime I can't quite get my head around how the algorithm works.

My understanding is that it is better to only create as many threads as there are cores, as creating threads creates a lot of additional overhead, so one should implement a thread pool/scheduler type arrangement to create x number of worker threads, and stack jobs onto a queue for the worker threads to get when they become idle. Is that correct?

Then, for arguments sake say you have two worker threads, and at a certain ply you put all the siblings of the first move onto the queue, and the worker threads set going on the first two siblings. Then at ply+1, the first moves of those two nodes are searched; say they don't produce a cut-off, so the siblings are all put onto the queue. I can't see how they can complete, because the worker threads are waiting for the results of ply+1, but there are no other worker threads free to calculate the siblings nodes at ply+1 that it is waiting for. Hope that made sense.

Also, in anticipation that I wouldn't be able to pass my game-state class around by reference anymore since different threads will be changing it simultaneously, I have made sure that I can copy it. However, that also seems like an awful lot of overhead -- is simply copying the game-state for each new parallel search the correct approach?

I'd like to just get something that works first before really trying to optimize it and such.

Thanks in advance for any advice.

Posted: **Fri Aug 28, 2015 8:56 pm**

ppyvabw wrote:Hi,

I am currently rewriting my engine to use bitboards and to attempt to use parallel search. As mentioned in a previous post, I was trying to improve the speeds of my move generator a while ago. I made some changes and I'm happy with my Perft speeds now. I also decided to switch to magic bitboards.

I have never written multi-threaded code before (I'm using C++), so it's a bit of a steep learning curve. So my question may sound a little confused, or perhaps not even make sense. I am presently trying to teach my self about thread pooling and stuff, but in the meantime I can't quite get my head around how the algorithm works.

My understanding is that it is better to only create as many threads as there are cores, as creating threads creates a lot of additional overhead, so one should implement a thread pool/scheduler type arrangement to create x number of worker threads, and stack jobs onto a queue for the worker threads to get when they become idle. Is that correct?

Then, for arguments sake say you have two worker threads, and at a certain ply you put all the siblings of the first move onto the queue, and the worker threads set going on the first two siblings. Then at ply+1, the first moves of those two nodes are searched; say they don't produce a cut-off, so the siblings are all put onto the queue. I can't see how they can complete, because the worker threads are waiting for the results of ply+1, but there are no other worker threads free to calculate the siblings nodes at ply+1 that it is waiting for. Hope that made sense.

Also, in anticipation that I wouldn't be able to pass my game-state class around by reference anymore since different threads will be changing it simultaneously, I have made sure that I can copy it. However, that also seems like an awful lot of overhead -- is simply copying the game-state for each new parallel search the correct approach?

I'd like to just get something that works first before really trying to optimize it and such.

You have to spend some time thinking about the issue you raise regarding deadlock. It is easily solvable, but it is doubtful anyone can explain it to you until you really understand what is going on. Then it will become obvious and easy to deal with.

Thanks in advance for any advice.

Obviously each thread has to have its own game-state. No way around that. All you can do is try to minimize the amount of data in this state to keep the cost of splitting the tree under control. Less is better.

In general, max threads = max real cores (not counting hyper-thread cores, just physical cores) is the best setting.

Posted: **Fri Aug 28, 2015 9:11 pm**

Thanks,

I still don't quite understand my other problem though. Presumably the scheduler must somehow know when a job in the queue is waiting for specific parallel searches to finish and put it to sleep to release worker threads until a value is returned? I'm very confused

Posted: **Fri Aug 28, 2015 9:38 pm**

I think my problem is that I am misunderstanding the algorithm.

For the parallel searches, should it call Jamboree again, or the serial algorithm -- i.e. negascout in my engine.

Posted: **Sun Aug 30, 2015 4:16 am**

ppyvabw wrote:I think my problem is that I am misunderstanding the algorithm.

For the parallel searches, should it call Jamboree again, or the serial algorithm -- i.e. negascout in my engine.

I am not sure what algorithm you are looking at. But if you have N threads, and N cores, where/when would a thread "wait" except when waiting for work to do?

Posted: **Tue Sep 01, 2015 12:30 am**

hyatt wrote:
ppyvabw wrote:I think my problem is that I am misunderstanding the algorithm.

For the parallel searches, should it call Jamboree again, or the serial algorithm -- i.e. negascout in my engine.
I am not sure what algorithm you are looking at. But if you have N threads, and N cores, where/when would a thread "wait" except when waiting for work to do?

Hi,

Part of your reply to my original post was inside your quote and I missed it until this morning: sorry for that. I think I understand the point you make about deadlock, having researched it a bit. I'm not certain that's what's at the root of my confusion though.

This is the algorithm I am trying to get my head around: https://chessprogramming.wikispaces.com/Jamboree.

I think (at least part) of my confusion stems from that particular snippet of code (the same algorithm is also in 'Massively parallel chess', Joerg, C. F. and Kuszmauf, B. C.). My understanding is that that Jamboree algorithm is a parellelized version of the Negascout algorithm. In the above link and in the paper, for the parallel searches it calls Jamboree again. I expect I am wrong, but it only makes sense to me if for the parallel searches it calls Negascout again; other wise it's just piling a load of parallel jobs onto the queue without resolving any -- as far as I can see with my limited knowledge. Isn't the point that the search tree should be something like the diagram here: https://chessprogramming.wikispaces.com/Parallel+Search, where the bits enclosed in dashed boxes are done in serial? I know that's PVS, but I guess the PV search and Negascout are roughly the same algorithm?

Posted: **Tue Sep 01, 2015 2:28 am**

Jamboree() is a replacement for alpha/beta, so yes it is called recursively. It does have an ugly synchronization point at the end of a "jamboree loop"...

the basic idea today is you search the first move, then if you have idle processors you call a parallel search to search the rest of the moves in parallel, then when that returns you finish and clean up that search ply. This can be done recursively as processors go idle, so you can split deeper and deeper in the tree.

OpenChess

Jamboree algorithm

Jamboree algorithm

Re: Jamboree algorithm

Re: Jamboree algorithm

Re: Jamboree algorithm

Re: Jamboree algorithm

Re: Jamboree algorithm

Re: Jamboree algorithm