blockchain, programming, tech

Upgrading Ethereum Solidity Smart Contracts

A pretty good blog on examples “Upgrading Ethereum Solidity Smart Contracts”, by Francis Odisi

Original post: https://levelup.gitconnected.com/introduction-to-ethereum-smart-contract-upgradability-with-solidity-789cc497c56f

When developing software, we frequently need to release new versions to add new functionality or bug fixes. There’s no difference when it comes to smart contract development. Although, updating a smart contract to a new version is usually not as simple as updating other types of software of the same complexity.

Most Blockchains, especially public ones like Ethereum, implement the intrinsic concept of immutability, which in theory, does not allow anyone to change the blockchain’s “past”. The immutability is applied to all transactions in the blockchain, including transactions used to deploy smart contracts and the associated code. In other words, once the smart contract’s code is deployed to the blockchain, it will “live” forever “AS IS” — no one can change it. If a bug is found or a new functionality needs to be added, we cannot replace the code of a deployed contract.

If a smart contract is immutable, how are you able to upgrade it to newer versions? The answer lies in deploying a new smart contract to the blockchain.

But this approach raises a couple of challenges that need to be addressed. The most basic and common ones are:

  • All users that use the smart contract need to reference the address of the new contract’s version
  • The first contract’s version should be disabled, enforcing every user to use the new version
  • Usually, you need to make sure the data (state) from the old version is migrated or somehow available to the new version. In the most simple scenario, this means you need to copy/migrate the state from the old version to the new contract’s version

The sections below describe these challenges in more detail. To better illustrate it, we’ll use the two versions below of MySmartContract as a reference:

Version 1

contract MySmartContract {
uint32 public counter; constructor() public {
counter = 0;
} function incrementCounter() public {
counter += 2; // This "bug" is intentional.
}
}

Version 2

contract MySmartContract {
uint32 public counter; constructor(uint32 _counter) public {
counter = _counter;
} function incrementCounter() public {
counter++;
}
}

Users to Reference the New Contract’s Address

When deployed to the blockchain, every instance of the smart contract is assigned to a unique address. This address is used to reference the smart contract’s instance in order to invoke its methods and read/write data from/to the contract’s storage (state).

When you deploy an updated version of the contract to the blockchain, the new instance of the contract will be deployed at a new address. This new address is different from the first contract’s address. This means that all users, other smart contracts and/or dApps (decentralized applications) that interact with the smart contract will need to be updated so they use the address of the updated version. Spoiler: there are some options to avoid this issue, that you’ll see at the end of this section.

So, let’s consider the following scenario:

You created MySmartContract using the code of Version 1 above. It is deployed to the blockchain at address A1 (this is not a real Ethereum address – used only for illustration purposes). All users that want to interact with Version 1 need to use the address A1 to reference it.

Now, after a while, we noticed the bug in the method incrementCounter: it is incrementing the counter by 2, instead of incrementing it by 1. A fix is implemented, resulting in Version 2 of MySmartContract. This new contract’s version is deployed to the blockchain at address D5. At this point, if a user wants to interact with Version 2, it needs to use the address D5, not A1. This is the reason why all users that are interacting with MySmartContract will need to update so they refer to the new address D5.

You probably agree that forcing users to update is not the best approach, considering that updating a smart contract’s version should be as transparent as possible to users using it.

There are different strategies that can be used to address this problem. Some design patterns like Registry, different types of Proxies can be used to make it easier to upgrade and provide transparency to users. Another great option is to use the Ethereum Name Service and register a user-friend name that resolves to your contract’s address. With this option, users of the contract don’t need to know the contract’s address, only its user-friendly name. As a result, upgrading to a new address would be transparent to your contract’s users. The specific strategy adopted depends on the scenario in which the smart contract will be used.

Disabling Old Versions of the Contract

We learned in the section above that all users would need an update to use Version 2‘s address (D5) or our contract should implement some mechanism to make this process transparent to users. Despite that, if you’re the owner of the contract, you probably want to enforce that all users use only the most up to date version D5. If a user inadvertently or not uses A1, you want to guarantee that Version 1 is deprecated and unavailable for usage.

In such scenarios, you could implement a technique to stop MySmartContract‘s Version 1. This technique is implemented by a Design Pattern named Circuit Breaker. It’s also commonly referred to as Pausable Contracts or Emergency Stop.

A Circuit Breaker, in general terms, stops a smart contract functionalities. Additionally, it can enable specific functionalities that will be available only when the contract is stopped. This pattern commonly implements some sort of access restriction, so only allowed actors (like an admin or owner) have the required permission to trigger the Circuit Breaker and stop the contract.

Some scenarios where this pattern can be used are:

  • Stopping a contract’s functionalities when a bug is found
  • Stop some contract’s functionalities after a certain state is reached (frequently used together with a State Machine pattern)
  • Stop the contract’s functionalities during upgrades processes, so external actors cannot change the contract’s state during the upgrade;
  • Stop a deprecated version of a contract after a new version is deployed

Now let’s see how you could implement a Circuit Breaker to stop MySmartContract‘s incrementCounter function, so counter wouldn’t change during the migration process. This modification would need to be in place in Version 1, when it was first deployed.

// Version 1 implementing a Circuit Breaker with access restriction to owner
contract MySmartContract {
uint32 public counter;
bool private stopped = false;
address private owner; /**
@dev Checks if the contract is not stopped; reverts if it is.
*/
modifier isNotStopped {
require(!stopped, 'Contract is stopped.');
_;
} /**
@dev Enforces the caller to be the contract's owner.
*/
modifier isOwner {
require(msg.sender == owner, 'Sender is not owner.');
_;
} constructor() public {
counter = 0;
// Sets the contract's owner as the address that deployed the contract.
owner = msg.sender;
} /**
@notice Increments the contract's counter if contract is active.
@dev It will revert if the contract is stopped. See modifier "isNotStopped"
*/
function incrementCounter() isNotStopped public {
counter += 2; // This is an intentional bug.
} /**
@dev Stops / Unstops the contract.
*/
function toggleContractStopped() isOwner public {
stopped = !stopped;
}
}

In the code above you can see that Version 1 of MySmartContract now implements a modifier isNotStopped. This modifier will revert the transaction if the contract is stopped. The function incrementCounter was changed to use the modifier isNotStopped, so it will only execute when the contract is NOT stopped.

With this implementation, right before the migration starts, the owner of the contract can invoke the function toggleContractStopped and stop the contract. Note that this function uses the modifier isOwner to restrict access to the contract’s owner.

To learn more about Circuit Breakers, make sure you check Consensys’ post about Circuit Breakers and OpenZeppelin’s reference implementation of Pausable contracts.

Contract’s Data (State) Migration

Most smart contracts need to keep some sort of state in its internal storage. The number of state variables required by each contract varies greatly depending on the use case. In our example, the original MySmartContract‘s Version 1 has a single state variable counter.

Now consider that Version 1 of MySmartContract has been in use for a while. By the time you find the bug in incrementCounter function, the value of counter is already at 100. This scenario would raise some questions:

  • What will you do with the state of MySmartContract Version 2?
  • Can you reset the counter to 0 (zero) in Version 2 or should you migrate the state from Version 1 to make sure counter is initialized with 100 in Version 2?

The answers to these questions will depend on the use case. In the example of this article, which is a really simple scenario and counter has no important usage, you wouldn’t have any issues if counter is reset to 0. But, this is not the desired approach in most cases.

Let’s suppose you cannot reset the value to 0 and need to set counter to 100 in Version 2. In a simple contract as MySmartContract this wouldn’t be difficult. You could change the constructor of Version 2 to receive the initial value of counter as a parameter. At deployment, you would pass the value 100 to the constructor, and this would solve your problem.

After implementing this approach, the constructor of MySmartContract Version 2 would look like this:

constructor(uint32 _counter) public {
counter = _counter;
}

If your use case is as simple as presented above (or similar), this is probably the way to go from a data migration perspective. The complexity of implementing other approaches wouldn’t be worth it. But, bear in mind that most production-ready smart contracts are not as simple as MySmartContract and frequently have a more complex state.

Now consider a contract that uses multiple structs, mappings, and arrays. If you need to copy data between contract versions with such complex storage, you would probably face one or more the challenges below:

  • A bunch of transactions to be processed on the blockchain, which may take a considerable amount of time, depending on the data set
  • Additional code to handle reading data from “Version 1” and writing it to “Version 2” (unless done manually)
  • Spend real money to pay for gas. Remember that you need to pay gas to process transactions in the blockchain. According to the Ethereum Yellow Paper — Appendix G. Fee Schedule, the SSTORE operation, upcode used to write data to Ethereum, costs 20000 gas units “when the storage value is set to non-zero from zero” and 5000 gas units “when storage value’s zeroness remains unchanged”.
  • Freeze Version 1‘s state by using some mechanism (like a Circuit Breaker) to make sure no more data is appended to Version 1 during the migration.
  • Implement access restriction mechanisms to avoid external parties (not related to the migration) from invoking functions of Version 2 during the migration. This would be required to make sure Version 1‘s data could be copied/migrated to Version 2 without being compromised and/or corrupted in Version 2;

In contracts with more complex state, the work required to perform an upgrade is quite significant and can incur considerable gas costs to copy data over the blockchain. Using Libraries and Proxies can help you develop smart contracts that are easier to upgrade. With this approach, the data would be kept in a contract that stores the state but bears no logic (state contract). The second contract or library implements the logic, but bears no state (logic contract). So when a bug is found in the logic, you only need to upgrade the logic contract, without worrying about migrating the state stored in the state contract (see Note below).

Note: This approach generally uses Delegatecall. The state contract invokes the functions in the logic contract using delegatecall. The logic contract then executes its logic in the context of state contract, which means that “storage, current address and balance still refer to the calling contract, only the code is taken from the called address.” (from Solidity docs referenced above).

Making MySmartContract Easier to Upgrade

Below you can see how Version 1 and Version 2 would look like if we implement the changes described here in this article. It’s important to mention again that the strategies used for MySmartContract are acceptable considering its simplicity: state variables and logic.

First, let’s see Version 1 changes:

Version 1 — Without Upgradable Mechanisms

contract MySmartContract {
uint32 public counter; constructor() public {
counter = 0;
} function incrementCounter() public {
counter += 2; // This "bug" is intentional.
}
}

In the code below, Version 1 implements a Circuit Breaker with an Access Restriction mechanism that allows the owner to stop the contract once it is deprecated.

Version 1 — With Deprecation Mechanism

contract MySmartContract {
uint32 public counter;
bool private stopped = false;
address private owner; /**
@dev Checks if the contract is not stopped; reverts if it is.
*/
modifier isNotStopped {
require(!stopped, 'Contract is stopped.');
_;
} /**
@dev Enforces the caller to be the contract's owner.
*/
modifier isOwner {
require(msg.sender == owner, 'Sender is not owner.');
_;
} constructor() public {
counter = 0;
// Sets the contract's owner as the address that deployed the contract.
owner = msg.sender;
} /**
@notice Increments the contract's counter if contract is active.
@dev It will revert is the contract is stopped. See modifier "isNotStopped"
*/
function incrementCounter() isNotStopped public {
counter += 2; // This is an intentional bug.
} /**
@dev Stops / Unstops the contract.
*/
function toggleContractStopped() isOwner public {
stopped = !stopped;
}
}

Now let’s see how Version 2 would look like: Version 2 – Without Upgradable Mechanisms

contract MySmartContract {
uint32 public counter; constructor(uint32 _counter) public {
counter = _counter;
} function incrementCounter() public {
counter++;
}
}

In the code below Version 2 implements the same Circuit Breaker and Access Restriction mechanisms as Version 1. In addition, it implements a constructor that allows setting the initial value of counter during deployment. This mechanism can be used, which can be used during an upgrade to copy data from an old version.

Version 2 — With Simple Upgradable Mechanism

contract MySmartContract {
uint32 public counter;
bool private stopped = false;
address private owner; /**
@dev Checks if the contract is not stopped; reverts if it is.
*/
modifier isNotStopped {
require(!stopped, 'Contract is stopped.');
_;
} /**
@dev Enforces the caller to be the contract's owner.
*/
modifier isOwner {
require(msg.sender == owner, 'Sender is not owner.');
_;
} constructor(uint32 _counter) public {
counter = _counter; // Allows setting counter's initial value on deployment.
// Sets the contract's owner as the address that deployed the contract.
owner = msg.sender;
} /**
@notice Increments the contract's counter if contract is active.
@dev It will revert is the contract is stopped. See modifier "isNotStopped"
*/
function incrementCounter() isNotStopped public {
counter++; // Fixes bug introduced in version 1.
} /**
@dev Stops / Unstops the contract.
*/
function toggleContractStopped() isOwner public {
stopped = !stopped;
}
}

Although the changes above implement some mechanisms that help upgrading smart contracts, the first challenge described in the beginning of this article, Users to Reference the New Contract’s Address, is not solved with these simple techniques. More advanced patterns like Proxies and Registry, or using the ENS to register a user-friendly name to your contract, would be required to avoid all users from upgrading to reference the new address of Version 2.

Conclusion

The principle of upgradable smart contracts is described in the Ethereum white paper’s DAO section that reads:

“*Although code is theoretically immutable, one can easily get around this and have de-facto mutability by having chunks of the code in separate contracts, and having the address of which contracts to call stored in the modifiable storage. *”

Although it is achievable, upgrading smart contracts can be quite challenging. The immutability of the blockchain adds more complexity to smart contract’s upgrades because it forces you to carefully analyze the scenario in which the smart contract will be used, understand the available mechanisms, and then decide which mechanisms are a good fit to your contract, so a potential and probable upgrade will be smooth.

Smart Contract upgradability is an active area of research. Related patterns, mechanisms and best practices are still under continuous discussion and development. Using Libraries and some Design Patterns like Circuit Breaker, Access Restriction, Proxies and Registry can help you to tackle some of the challenges. However, in more complex scenarios, these mechanisms alone may not be able to address all the issues, and you may need to consider more complex patterns like Eternal Storage, not mentioned in this article.

You can check the full source code, including related unit tests (not mentioned in this article for simplicity reasons) in this github repository.

Standard
技术

fastText 源码分析

fastText 源码分析

介绍

fastText 是 facebook 近期开源的一个词向量计算以及文本分类工具,该工具的理论基础是以下两篇论文:

Enriching Word Vectors with Subword Information

这篇论文提出了用 word n-gram 的向量之和来代替简单的词向量的方法,以解决简单 word2vec 无法处理同一词的不同形态的问题。fastText 中提供了 maxn 这个参数来确定 word n-gram 的 n 的大小。 Continue reading

Standard
技术

DL AI 芯片 市场整理

DL市场整理

近一年各种深度学习平台和硬件层出不穷,各种xPU的功耗和面积数据也是满天飞,感觉有点乱。在这里我把我看到的一点情况做一些小结,顺便列一下可能的市场。在展开之前,我想强调的是,深度学习的应用无数,我能看到的只有能在千万级以上的设备中部署的市场,各个小众市场并不在列。

 

深度学习目前最能落地的应用有两个方向,一个是图像识别,一个是语音识别。这两个应用可以在如下市场看到:个人终端(手机,平板),监控,家庭,汽车,机器人,服务器。

先说手机和平板。这个市场一年的出货量在30亿颗左右(含功能机),除苹果外总值300亿刀。手机主要玩家是苹果(3亿颗以下),高通(8亿颗以上),联发科(7亿颗以上),三星(一亿颗以下),海思(一亿颗),展讯(6亿颗以上),平板总共4亿颗左右。而28纳米工艺,量很大的话(1亿颗以上),工程费用可以摊的很低,平均1平方毫米的成本是8美分左右,低端4G芯片(4核)的面积差不多是50平方毫米以下,成本就是4刀。中端芯片(8核)一般在100平方毫米左右,成本8刀。16纳米以及往上,同样的晶体管数,单位成本会到1.5倍。一般来说,手机的物料成本中,处理器芯片(含基带)价格占了1/6左右。一个物料成本90刀的手机,用的处理器一般在15刀以下,甚至只有10刀。这个10刀的芯片,包含了处理器,图形处理器,基带,图像信号处理器,每一样都是高科技的结晶,却和肯德基全家桶一个价,真是有点惨淡。然而生产成本只是一部分,人力也是很大的开销。一颗智能机芯片,软硬开发,测试,生产,就算全用的成熟IP,也不会少于300人,每人算10万刀的开销,量产周期两年,需要6000万刀。外加各种EDA工具,IP授权和开片费,芯片还没影子,1亿刀就下去了。

言归正传,手机上的应用,最直接的就是美颜相机,AR和语音助手。这些需求翻译成硬件指令就是对8位整数点乘(INT8)和16位浮点运算(FP16)的支持。具体怎么支持?曾经看到过一张图,我觉得较好的诠释了这一点:

智能手机和平板上,是安卓的天下,所有独立芯片商都必须跟着谷歌爸爸走。谷歌已经定义了Android NN作为上层接口,可以支持它的TensorFlow以及专为移动设备定义的TensorFlow Lite。而下层,针对各种不同场景,可以是CPU,GPU,DSP,也可以是硬件加速器。它们的能效比如下图:

可以看到,在TSMC16纳米工艺下,大核能效比是10-100Gops/W(INT8),小核可以做到100G-1Tops/W,手机GPU是300Gops/W,而要做到1Tops/W以上,必须使用加速器。这里要指出的是,小核前端设计思想与大核完全不同,在后端实现上也使用不同的物理单元,所以看上去和大核的频率只差50%,但是在逻辑运算能效比上会差4倍以上,在向量计算中差的就更多了。

手机的长时间运行场景下,芯片整体功耗必须小于2.5瓦,分给深度学习任务的,不会超过1.5瓦。相对应的,如果做到1Tops/W,那这就是1.5T(INT8)的处理能力。对于照片识别而言,情况要好些,虽然对因为通常不需要长时间连续的处理。这时候,CPU是可以爆发然后休息的。语音识别对性能要求比较低,100Gops可以应付一般应用,用小核也足够。但有些连续的场景,比如AR环境识别,每秒会有30-60帧的图像送进来,如果不利用前后文帮助判断,CPU是没法处理的。此时,就需要GPU或者加速器上场。

上图是NVidia的神经网络加速器DLA,它只有Inference的功能。前面提到在手机上的应用,也只需要Inference来做识别,训练可以在服务端预先处理,训练好的数据下载到手机就行,识别的时候无需连接到服务端。

DLA绿色的模块形成类似于固定的流水线,上面有一个控制模块,可以用于动态分配计算单元,以适应不同的网络。我看到的大多数加速器其实都是和它大同小异,除了这了几百K字节的SRAM来存放输入和权值(一个273×128, 128×128, 128×128 ,128×6 的4层INT8网络,需要70KBSRAM)外,而有些加速器增加了一个SmartDMA引擎,可以通过简单计算预取所需的数据。根据我看到的一些跑分测试,这个预取模块可以把计算单元的利用率提高到90%以上。

至于能效比,我看过的加速器,在支持INT8的算法下,可以做到1.2Tops/W (1Ghz@T16FFC),1Tops/mm^2,并且正在向1.5Tops/W靠近。也就是说,1.5W可以获得2Tops(INT8)的理论计算能力。这个计算能力有多强呢?我这目前处理1080p60FPS的图像中的60×60及以上的像素大小的人脸识别,大致需要0.5Tops的计算能力,2Tops完全可以满足。当然,如果要识别复杂场景,那肯定是计算力越高越好。

为什么固定流水的能效比能做的高?ASIC的能效比远高于通用处理器已经是一个常识,更具体一些,DLA不需要指令解码,不需要指令预测,不需要乱序执行,流水线不容易因为等待数据而停顿。下图是某小核各个模块的动态功耗分布,计算单元只占1/3,而指令和缓存访问占了一半。

但是移动端仅仅有神经网络加速器是远远不够的。比如要做到下图效果,那首先要把人体的各个细微部位精确识别,然后用各种图像算法来打磨。而目前主流图像算法和深度学习没有关系,也没看到哪个嵌入式平台上的加速器在软件上有很好的支持。目前图像算法的支持平台还主要是PC和DSP,连嵌入式GPU做的都一般。

那这个问题怎么解决?我看到两种思路:

第一种,GPU内置加速器。下图是Verisilicon的Vivante改的加速器,支持固定流水的加速器和可编程模块Vision core(类似GPU中的着色器单元),模块数目可配,可以同时支持视觉和深度学习算法。不过在这里,传统的图形单元被砍掉了,以节省功耗和面积。只留下调度器等共用单元,来做异构计算的调度。

这类加速器比较适合于低端手机,自带的GPU和CPU本身并不强,可能光支持1080p的UI就已经耗尽GPU资源了,需要额外的硬件模块来完成有一定性能需求的任务。

对于中高端手机,GPU和CPU的资源在不打游戏的时候有冗余,那么就没有必要去掉图形功能,直接在GPU里面加深度学习加速器就可以,让GPU调度器统一调度,进行异构计算。

上图是某款GPU的材质计算单元,你有没有发现,其实它和神经网络加速器的流水线非常类似?都需要权值,都需要输入,都需要FP16和整数计算,还有数据压缩。所不同的是计算单元的密度,还有池化和激活。稍作改动,完全可以兼容,从而进一步节省面积。

但是话说回来,据我了解,目前安卓手机上各种图像,视频和视觉的应用,80%其实都是用CPU在处理。而谷歌的Android NN,默认也是调用CPU汇编。当然,手机芯片自带的ISP及其后处理,由于和芯片绑的很紧,还是能把专用硬件调动起来的。而目前的各类加速器,GPU,DSP,要想和应用真正结合,还有挺长的路要走。

终端设备上还有一个应用,AR。据说iPhone8会实现这个功能,如果是的话,那么估计继2015的VR/AR,2016的DL,2017的NB-IOT之后,2018年又要回锅炒这个了。

那AR到底用到哪些技术?我了解的如下,先是用深度传感器得到场景深度信息,然后结合摄像头拍到的2维场景,针对某些特定目标(比如桌子,面部)构建出一个真实世界的三维物体。这其中需要用到图像识别来帮助判断物体,还需要确定物体边界。有了真实物体的三维坐标,就可以把所需要渲染的虚拟对象,贴在真实物体上。然后再把摄像头拍到的整个场景作为材质,贴到背景图层,最后把所有这些图层输出到GPU或者硬件合成器,合成最终输出。这其中还需要判断光源,把光照计算渲染到虚拟物体上。这里每一步的计算量有多大?

首先是深度信息计算。获取深度信息目前有三个方法,双目摄像头,结构光传感器还有TOF。他们分别是根据光学图像差异,编码后的红外光模板和反射模板差异,以及光脉冲飞行时间来的得到深度信息。第一个的缺点是需要两个摄像头之间有一定距离并且对室内光线亮度有要求,第二个需要大量计算并且室外效果不佳,第三个方案镜头成本较高。据说苹果会用结构光方案,主要场景是室内,避免了缺点。结构光传感器的成本在2-3刀之间,也是可以接受的。而对于计算力的要求,最基本的是对比两个经过伪随机编码处理过的发射模板以及接受模板,计算出长度差,然后用矩阵倒推平移距离,从而得到深度信息。这可以用专用模块来处理,我看到单芯片的解决方案,720p60FPS的处理能力,需要20GFLOPS FP32的计算量以上。换成CPU,就是8核。当然,我们完全可以先识别出目标物体,用图像算法计算出轮廓,还可以降低深度图的精度(通常不需要很精确),从而大大降低计算量。而识别本身的计算量前文已经给出,计算轮廓是经典的图像处理手段,针对特定区域的话计算量非常小,1-2个核就可以搞定。

接下去是根据深度图,计算真实物体的三维坐标,并输出给GPU。这个其实就是GPU渲染的第一阶段的工作,称作顶点计算。在移动设备上,这部分通常只占GPU总计算量的10%,后面的像素计算才是大头。产生虚拟物体的坐标也在这块,同样也很轻松。

接下去是生成背景材质,包括产生minimap等。这个也很快,没什么计算量,把摄像头传过来的原始图像放到内存,告诉GPU就行。

稍微麻烦一些的是计算虚拟物体的光照。背景贴图的光照不需要计算,使用原图中的就可以。而虚拟物体需要从背景贴图抽取亮度和物体方向,还要计算光源方向。我还没有见过好的算法,不过有个取巧,就是生成一个光源,给一定角度从上往下照,如果对AR要求不高也凑合了。

其他的渲染部分,和VR有些类似,什么ATW啊,Front Buffer啊,都可以用上,但是不用也没事,毕竟不是4K120FPS的要求。总之,AR如果做的不那么复杂,对CPU和GPU的性能要求并不高,搞个图像识别模块,再多1-2个核做别的足矣。

有了计算量,深度学习加速器对于带宽的需求是多少?如果SRAM足够大,1Tops的计算量需要5GB/s以下的带宽。连接方法可以放到CPU的加速口ACP(跑在1.8GHz的ARMv8.2内部总线可以提供9GB/s带宽)。只用一次的数据可以设成非共享类型,需要和CPU交换或者常用的数据使用Cacheable和Shareable类型,既可以在三级缓存分配空间,还可以更高效的做监听操作,免掉刷缓存。

不过,上述前提成立的前提是权值可以全部放到SRAM或者缓存。对于1TOPS INT8的计算量,所需权值的大小是512GB。如果全部放DDR,由于手机的带宽最多也就是30GB/S,是完全不够看的。对于输入,中间值和输出数据,我在上文有个例子,一个273×128, 128×128, 128×128 ,128×6 的4层INT8网络,需要70KB的SRAM(片内)放权值,共7万个。但是输入,输出和中间结果加起来却只有535个,相对来说并不大。这里的运算量是14万次(乘和加算2次)。对于1T的运算量来说,类似。中间数据放寄存器,输出数据无关延迟,只看带宽,也够。最麻烦的就是权值,数据量大到带宽无法接受。我看到的有些深度学习的算法,权值在几十到200兆,这样无论如何是塞不进SRAM的。哪怕只有10%需要读入,那也是50GB/s的带宽。我觉得,现阶段宣传的各种漂亮的跑分和图像识别速度,一旦权值太大,一定会让计算单元利用率大大下降。虽说现在有压缩算法压缩稀疏矩阵,也有人号称有几十倍的压缩率,但对于繁杂的各类应用,可能最后实际效果没比CPU好到哪里去。

如果加速器在GPU上,那么还是得用传统的ACE口,一方面提高带宽,一方面与GPU的核交换数据在内部进行,当然,与CPU的交互必然会慢一些。

在使用安卓的终端设备上,深度学习可以用CPU/DSP/GPU,也可以是加速器,但不管用哪个,一定要跟紧谷歌爸爸。谷歌以后会使用Vulkan Compute来替代OpenCL,使用Vulkan 来替代OpenGL ES,做安卓GPU开发的同学可以早点开始熟悉了。

高通推过用手机做训练,然后手机间组网,形成强大的计算力。从我的角度看,这个想法问题多多,先不说实际应用,谁会没事开放手机给别人训练用?耗电根本就吃不消。并且,要是我知道手机偷偷的上传我的图像和语音模板到别人那里,绝对不会买。

第二个市场是家庭,包括机顶盒/家庭网关(4亿颗以下),数字电视(3亿颗以下),电视盒子(1亿以下)三大块。整个市场出货量在7亿片,电器里面的MCU并没有计算在内。这个市场公司比较散,MStar/海思/博通/Marvell/Amlogic都在里面,小公司更是无数。如果没有特殊要求,拿平板的芯片配个wifi就可以用。当然,中高端的对画质还是有要求,MTK现在的利润从手机移到了电视芯片,屏幕显示这块有独到的技术。很多机顶盒的网络连接也不是以太网,而是同轴电缆等,这种场合也得专门的芯片。最近,这个市场里又多了一个智能音箱,各大互联网公司又拿出当年追求手机入口的热情来布局,好不热闹。

家庭电子设备里还有一个成员,游戏机。Xbox和PS每年出货量均在千万级别。VR/AR和人体识别早已经用在其中。

对于语音设别,100Gops的性能需求对于无风扇设计引入的3瓦功耗限制,CPU/DSP和加速器都可以选。不过工艺就得用28纳米了或者更早的了,毕竟没那么多量,撑不起16纳米。最便宜的方案,可以使用RISC-V+DLA,没有生态系统绑定的情况下最省成本。独立的加速器本身对CPU要求并不高,不像GPU那样需要支持OpenCL/OpenGL ES。8核G71@900Mhz差不多需要一个2GHz的A73来支持。并且由于OpenGL ES的限制,还不能使用小核来分担任务。而100Gops的语音处理我估计几百兆赫兹的处理器就可以了。

图像方面的应用,主要还是人脸识别和播放内容识别,不过这还没有成为一个硬需求。之前提过,0.5Tops足以搞定简单场景,4K分辨率的话,性能需求是1080p的四倍。

接下去是监控市场。监控市场上的图像识别是迄今为止深度学习最硬的需求。监控芯片市场本身并不大,有1亿颗以上的量,销售额20亿刀左右。主流公司有安霸,德州仪器和海思,外加几个小公司,OEM自己做芯片的也有。

传统的监控芯片数据流如上图蓝色部分,从传感器进来,经过图像信号处理单元,然后送给视频编码器编码,最后从网络输出。如果要对图像内容进行识别,那可以从传感器直接拿原始数据,或者从ISP拿处理过的图像,然后进行识别。中高端的监控芯片中还会有个DSP,做一些后处理和识别的工作。现在深度学习加速器进来,其实和DSP是有些冲突的。以前的一些经典应用,比如车牌识别等,DSP其实就已经做得很好了。如果要做识别以外的一些图像算法,这颗DSP还是得在通路上,并不能被替代。并且,DSP对传统算法的软件库支持要好得多。这样,DSP替换不掉,额外增加处理单元在成本上就是一个问题。

对于某些低功耗的场景,我看到有人在走另外一条路。那就是完全扔掉DSP,放弃存储和传输视频及图像,加入加速器,只把特征信息和数据通过NB-IOT上传。这样整个芯片功耗可以控制在500毫瓦之下。整个系统结合传感器,只在探测到有物体经过的时候打开,平时都处于几毫瓦的待机状态。在供电上,采用太阳能电池,100mmx100mm的面板,输出功率可以有几瓦。不过这个产品目前应用领域还很小众。

做识别的另一个途径是在局端。如果用显卡做,GFX1080的FP32 GLOPS是9T,180瓦,1.7Ghz,16纳米,320mm。而一个Mali G72MP32提供1T FP32的GFLOPS,16纳米,850Mhz,8瓦,9T的话就是72瓦,666mm。当然,如果G72设计成跑在1.7Ghz,我相信不会比180瓦低。此外桌面GPU由于是Immediate rendering的,带宽大,但对缓存没有很大需求,所以移动端的GPU面积反而大很多,但相对的,它对于带宽需求小很多,相应的功耗少很多

GPU是拿来做训练的,而视频识别只需要做Inference,如果用固定流水的加速器,按照NVIDIA Tesla P40的数据,48T INT8 TOPS,使用固定流水加速器,在16nm上只需要48mm。48Tops对应的识别能力是96路1080p60fps,96路1080p60fps视频解码器对应的面积差不多是 50mm,加上SRAM啥的,估计200mm以下。如果有一千万的量,那芯片成本可以做到40美金以下(假定良率还可以,不然路数得设计的小一点),而一块Tesla P40板子的售价是500美金(包括DDR颗粒),还算暴利。国内现在不少小公司拿到了投资在做这块的芯片。

第四个市场是机器人/无人机。机器人本身有多少量我没有数据,手机和平板的芯片也能用在这个领域。无人机的话全球一年在200万左右,做视觉处理的芯片也应该是这个量级。。用到的识别模块目前看还是DSP和CPU为主,因为DSP还可以做很多图像算法,和监控类似。这个市场对于ISP和深度信息的需求较高,双摄和结构光都可以用来算深度计算,上文提过就不再展开。

在无人机上做ISP和视觉处理,除了要更高的清晰度和实时性外,还比消费电子多了一个要求,容错。无人机的定位都靠视觉,如果给出的数据错误或者模块无反应都不符合预期。解决这个问题很简单,一是增加各种片内存储的ECC和内建自检,二是设两个同样功能的模块,错开时钟输入以避免时钟信号引起的问题,然后输出再等相同周期,同步到一个时钟。如果两个结果不一致,那就做特殊处理,避免扩散数据错误。

第五个市场是汽车,整个汽车芯片市场近300亿刀,玩家众多:

在汽车电子上,深度学习的应用就是ADAS了。在ADAS里面,语音和视觉从技术角度和前几个市场差别不大,只是容错这个需求进一步系统化,形成Function Safety,整个软硬件系统都需要过认证,才容易卖到前装市场。Function Safety比之前的ECC/BIST/Lock Step更进一步,需要对整个芯片和系统软件提供详细的测试代码和文档,分析在各类场景下的错误处理机制,连编译器都需要过认证。认证本身分为ASIL到A-ASIL-D四个等级,最高等级要求系统错误率小于1%。我对于这个认证并不清楚,不过国内很多手机和平板芯片用于后装市场的ADAS,提供语音报警,出货量也是过百万的。

最后放一张ARM的ADAS参考设计框图。

可能不会有人照着这个去设计ADAS芯片,不过有几处可以借鉴:

右方是安全岛,内涵Lock Step的双Cortex-R52,这是为了能够保证在左边所有模块失效的情况下复位整个系统或者进行异常中断处理的。中部蓝色和绿色的CryptoCell模块是对整个系统运行的数据进行保护,防止恶意窃取的。关于Trustzone设计我以前的文章有完整介绍这里就不展开了。

以上几个市场基本都是Inference的需求,其中大部分是对原有产品的升级,只有ADAS,智能音箱和服务器端的视频识别检测是新的市场。其中智能音箱达到了千万级别,其它的两个还都在扩张。

接下去的服务端的训练硬件,可以用于训练的移动端GPU每个计算核心面积是1.5mm(TSMC16nm),跑在1Ghz的时候能效比是300Gops/W。其他系统级的性能数据我就没有了。虽然这个市场很热,NVidia的股票也因此很贵,但是我了解到全球用于深度学习训练的GPU销售额,一年只有1亿刀不到。想要分一杯羹,可能前景并没有想象的那么好。

最近970发布,果然上了寒武纪。不过2T ops FP16的性能倒是让我吃了一惊,我倒推了下这在16nm上可能是6mm的面积,A73MP4+A53MP4(不含二级缓存)也就是这点大小。麒麟芯片其实非常强调面积成本,而在高端特性上这么舍得花面积,可见海思要在高端机上走出自己的特色之路的决心,值得称道。不过寒武纪既然是个跑指令的通用处理器,那除了深度学习的计算,很多其他场合也能用上,比如ISP后处理,计算结构光深度信息等等,能效可能比DSP还高些。

Standard