BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20250626T234542Z
LOCATION:B312-B313A
DTSTART;TZID=America/New_York:20241120T140000
DTEND;TZID=America/New_York:20241120T143000
UID:submissions.supercomputing.org_SC24_sess370_pap369@linklings.com
SUMMARY:LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor 
 Cores
DESCRIPTION:Yiwei Zhang (University of the Chinese Academy of Sciences, Mi
 crosoft Research); Kun Li (Microsoft Corporation); Liang Yuan (CAS); Jiawe
 n Cheng (Tsinghua University, China); Yunquan Zhang (CAS); and Ting Cao an
 d Mao Yang (Microsoft Corporation)\n\nStencil computations play a pivotal 
 role in numerous scientific and industrial applications, yet their efficie
 nt execution on specialized hardware accelerators like Tensor Core Units (
 TCUs) remains a challenge. This paper introduces LoRAStencil,  a novel ste
 ncil computing system designed to mitigate memory access redundancies on T
 CUs through low-rank adaptation. We first identify a nuanced form of this 
 redundancy, dimension residue, specific to TCUs. Then LoRAStencil leverage
 s orchestrated mathematical transformations to decompose stencil weight ma
 trices into smaller rank-1 matrices, facilitating efficient data gathering
  along residual dimensions. It comprises three key components: memory-effi
 cient Residual Dimension Gathering to facilitate more data reuse, compute-
 saving Pyramidal Matrix Adaptation to exploit the inherent low-rank charac
 teristics, and performance-boosting Butterfly Vector Swapping to circumven
 t all data shuffles. Comprehensive evaluations demonstrate that LoRAStenci
 l address dimension residues effectively, which outperforms state-of-the-a
 rts with up to a 2.16x speedup, offering promising advancements for effici
 ent tensorized stencil computation on TCUs by Low-Rank Adaptation.\n\nTag:
  Accelerators, Algorithms, Data Compression, Linear Algebra, Tensors\n\nRe
 gistration Category: Tech Program Reg Pass\n\nSession Chair: Rio Yokota (I
 nstitute of Science Tokyo)\n\n
END:VEVENT
END:VCALENDAR
