Disqus Comments

Tarik L • 6 years ago

AyyyMD.

Mike • 6 years ago

AMDead

Javi_Charly • 6 years ago

Your bank account is dead too.

Jeb! • 6 years ago

Don't worry guys the RX580 is still faster!

Mike • 6 years ago

Yes yes just gotta overclock it a bit, the gains are ayyymazing!

Marcelo Viana • 6 years ago

At least i can buy a RX580 even crossfire it to beat a 1080ti, but i have to admit i don't have $140K to expend on nvidia new card solution.

xfatalxzero • 6 years ago

Though just as sli, crossfire has alot of issues unless you stick the the games that teuly support it. If not your either stuck with issues or just using ones of your cards.

Tom Wallen • 6 years ago

And a pair of 580's will cost more than a 1080ti thanks to the mining craze (Vega will probably suffer the same fate).

Plus, it'll pull 4x the power and 10x the heat and half the performance on games that don't support multi-GPU.

Shahnewaz • 6 years ago

815mm^2 on a 12nm process. That is a humongous GPU.

Husky™√ • 6 years ago

Reticle limit of TSMC

MUltan • 6 years ago

For scale, 35mm camera sensors are 864mm^2 (but they're way lower res. lithography.)

I thought mullti-chip modules had gotten to the point where it was possible to divide such monster chips into several pieces with up to thousands of bus lines going through through-silicon vias then running a just a mm or two across the carrier between chips. I guess not, since if you could restrict the pieces to less than a couple 100 mm^2 then yields could be something like an order of magnitude higher.

TownCalendars • 5 years ago

Does the increase in FFT of around 1.7 also hold up for large FFT's, such as with 2^20 data points?

MUltan • 5 years ago

I think you meant to be replying to somebody else, but from what I have seen in GPGPU papers, generally the bigger the FFT, the bigger the speedup.

Ben Cumming • 6 years ago

The independent thread scheduling in a warp looks very interesting. With this feature, is it possible to have threads in different branches participate in warp intrinsics like __ballot or __shfl?

Olivier Giroux • 6 years ago

Yes. Note that you have to use the new "sync" versions these builtin functions, which take an additional parameter to specify which threads participate in the operation.

Ahmed ElTantawy • 6 years ago

I assume the __syncwarp() is heavily added by the compiler as well whenever safe? ... but does this give away reconvergence of sub-warps in nested divergence or do the compiler still enforces this in an implicit way (when safe)?

Olivier Giroux • 6 years ago

The compiler uses a different set of instructions for convergence optimizations. You should expect the same convergence as Pascal (for code that both architectures can run) at no additional effort on your part.

Ahmed ElTantawy • 6 years ago

so it converges (roughly?) at IPDOM unless synchronization is detected, it converges at the safest reconvergence point the compiler can detect?

If so then, "You should expect the same convergence as Pascal (for code that both architectures can run) at no additional effort on your part." sounds like the compiler can never "falsely" detect synchronization, which does not sound realistic?

Marcelo Viana • 6 years ago

If the intention is to use 8 GV100 on a DGX-1 why 6 NVlink? should be 7 NVlinks, or i miss something?

MUltan • 6 years ago

"Each Tensor Core performs 64 floating point FMA mixed-precision
operations per clock (FP16 input multiply with full-precision
product and FP32 accumulate, as Figure 8 shows) and 8 Tensor Cores in an
SM perform a total of 1024 floating point operations per clock."

How many 4 x 4 matrix-matrix multiplications is that per clock?
I think it's 16 or 64 FMAs per matrix multiplication, so either 4 or 1 MMMs per clock, but the article doesn't say.

The fp16 format is dismayingly approximate, with an infinity that starts above 65,504, and a minimum value above 1 of only a bit less than 1.001. Resolution between 0 and 1 is less cramped, though ( ~13-14 bits, I think) and is often all the application requires, especially dealing with probabilities and data normalized to the [0,1] or [-1, 1]range.

The Tensor cores look like they might be useful for doing 4D Geometric Algebra (GA)/ Clifford Algebra calculations, which would be extremely cool, since GA is the best way to do math representing physics, whether classical mechanics, EM, QM, SR or GR - too many advantages to list here, but I'll point out Geomerics, (the British company that brought real-time radiosity lighting to games, bought by ARM) was the work of the worlds top GA physicists, particularly Cambridge's Chris Doran.

There are only 2 Clifford algebras that can be represented with
real-valued 4 x 4 matrices, Cl(3,1) (signature (+++-)) and Cl(2,2)
(signature (++- -)). The other 4D signatures require 2 x 2 matrices of
quaternions. The 2D + 2 Conformal Geometric Algebra (CGA) has a (+++-) Minkoski signature that can also be used for relativistic EM, (though the (+- - -) "space-time algebra" is more common for that use).

The 2D + 2 CGA represents 2D lines, circles, and points as points in a 4D space. It's like extending first to homogeneous coordinates, as in conventional graphics: the extra dimension allows constructing subspaces (e.g. lines) that don't pass through the origin on the 2D plane. In CGA that extra homogeneous dimension is called "origin" for that reason. In addition, CGA adds another extra dimension called "infinity", which allows representing points, circles and lines (and planes, spheres in 3D +2 CGA) as unified entities - a point has zero infinity component, a circle has some, and a line is a circle passing through the "point at infinity", a circle with infinite radius. Taking the outer product of any 3 points gives the circle passing through those points. An easy "dualization" converts the circle to a representation as a center point and a radius. There are some other primatives such as point-pairs (0D spheres) in CGA as well that are very useful All sorts of geometric operations such as unions, intersections are much easier in CGA.

Obviously the 3D +2 CGA is more useful for 3D graphics, but it is a 5D algebra which doesn't fit in the Tensor units. (Cl(4,1) needs 4x4 complex matrices) It would be useful to find out if there are practical ways to make GA calculation on GPUs easier and faster because that would make physics simulations in general much easier to program. GA gives a single, unified representation to areas that now are a vast collection of ad-hoc hacks that often don't work well together. Chris Doran would be the person to talk to about what would make GPUs better for GA and physics simulation in general.

Jack Smith • 6 years ago

These are not going to be competitive with the TPUs having 65535 MACs with 8 bits which is ideal.

Martin Bernreuther • 6 years ago

Tesla M40 Peak FP64?

Table 1 above:
TFLOP/s: 2.1 (about 2100 GFLOPs)

https://images.nvidia.com/c...
page 11 Table 1:
GFLOPs: 210

Where does the factor 10 come from?

96 FP64 Cores * 1114 MHz * ? FLOP/cyc
107 GHz * 2 (e.g. single FMA) would be about 210 GFLOPs
also cmp. https://en.wikipedia.org/wi...

Mark Harris • 6 years ago

Martin, thanks for catching this typo! It is indeed 0.21 TFLOP/s.

KaiGai Kohei • 6 years ago

MPS(Multi-Process Service) has a few restrictions. One of the most mysterious one is unsupport of dynamic parallelism. Is it still prohibited on the Volta generation?

Mark Harris • 6 years ago

Not supported in CUDA 9 due to schedule constraints, but should be supportable on Volta MPS in the future.

Robert Miles • 6 years ago

Will the Volta architecture be extended to GPUs for graphics boards in addition to GPUs for data center accelerators?

Bulat Ziganshin • 6 years ago

it seems that each FP32/INT32 instruction sheduled occupy sub-SM for a two cycles, so on the next sycle other type of instruction should be issued - pretty similar to LD/ST instructions on all NVidia GPUs as well as sheduling on SM 1.x GPUs

So, the new architecture allows to run FP32 instructions at full rate, and use remaining 50% of issue slots to execute all other type of istructions - INT32 for index/cycle calculations, load/store, branches, SFU, FP64 and so on. And unlike Maxwell/Pascal, full GPU utilization doesn't need to pack pairs of coissued instructions into the same thread - each next cycle can execute instructions from differemnt thread, so one thread perfroming series of FP32 instructions and other thread perfroming series of INT32 instructions, will load both blocks by 100%

is my understanding correct?

Olivier Giroux • 6 years ago

That is correct.

Mrugesh • 6 years ago

Will the tensor core intrinsics be able to work on any arbitrary 4x4 submatrices (of any bigger matrix) or they have to be linear in memory?
As in, can I just specify the coordinates of A, B, C and D submatrices in bigger matrix and tensor core will be working on those directly?
If yes, what stopped NVIDIA from providing full FP32 4x4 matrix multiplication core?

Mark Harris • 6 years ago

The tensor core API will initially provide 3 warp-level operations: 1) load a "fragment" of matrix data from memory into registers. 2) perform a warp-cooperative matrix-matrix multiply on the input fragments in the registers of a warp and 3) store matrix "fragment" from registers into memory. The initial API will operate on 16x16 matrix fragments. With these operations and within their limitations you can indeed work on arbitrary sub matrices of any bigger matrix. My talk "CUDA 9 and Beyond" at GTC discussed this API and it will be published soon. Full details will be available when we release CUDA 9.

Song Han • 6 years ago

Is the 2.4x faster ResNet-50 Training using Tensor Core or not?

Mark Harris • 6 years ago

Yes, Tensor Cores were used. These are preliminary results (pre-release hardware and software).

John Dillon • 6 years ago

Can you point me to some information on your 48V solutions?

Mark Harris • 6 years ago

Can you clarify what you are asking for?

John Dillon • 6 years ago

Hi Mark, sorry, I should have been more clear.

Google is spearheading a new 48Volt architecture in the data centers. They have proposed a 48V rack to the Open Compute Project http://www.datacenterknowle.... The architecture allows for a 48V to 1V conversion for a GPU with only a one step conversion, and thereby skips the classic 12V intermediate bus.

It is my understanding that Nvidia has a board with a Volta GPU that will take a 48V input and convert the voltage to approximately 1V. I was looking for any information on your solution that allows for a 48V input to your GPU.

Thank you!

John Dillon • 6 years ago

Hi Mark,

I want to check back with you and see if I provided enough information.

Thank you!

John

PiotrLenarczyk • 6 years ago

Cool stuff. P100 was WORLD FIRST influential processor with double datatype computation efficiency commercially avaiable ( Intel Xeon Phi price is a joke - the same as P100 price ). Majority of non brainy-destructive usage of GPU will fit in brand-new GT730 4GB GDDR5 which costs approx. 50$ for Personal Computer ( mainly accelerated via pendrive live session ) obtained for free with high probability. Post Scriptum did anybody seen "E" letter in text documents? Post Post Scriptum There still will be a lot of people arguing that ones C#WrittenInC++ implemented on AMD=Intel is still faster and cooler than GPU CUDA C median programming example. P.P.P.S. It could efficiently compute FEM problems in theory.

Arne Kreutzmann • 6 years ago

Does the increase in FFT of around 1.7 also hold up for large FFT's, such as with 2^20 data points?

yi • 6 years ago

Can fp64, fp32, INT and TensorCores compute at the same time?

Mark Harris • 6 years ago

Either FP kind and INT can co-issue. FP and INT can also co-issue with memory instructions (this was true on Pascal). Tensor Core instructions can only co-issue with instructions that have zero operands (e.g. non-ALU instructions such as branches, predicate boolean logic, and others).

yi • 6 years ago

and
how many 4x4-matrix-multiplies does a Tensor Core need to reach its peak performance?

Maciej Szadkowski • 5 years ago

What is throttling temperature for v100 chips and what performance drop should we expect over that temp treshold?

Javier • 5 years ago

OpenCL productivity is so low, we will have multithreaded C++ on GPU?