Category Archives: Architecture

Library dependency V.S. Remote API dependency in SOA

Pros of Library Dependency:

  • Simplicity in terms of implementation. In java, all you need is Maven. There is no need for any RPC serialiser or RPC middleware. 
  • Simplicity in terms of deployment. You will need less applications than the RPC approach. Each application requires investment of auto-deployment, monitoring, hardware resources, and load balancing in some cases.  Less applications, less SiteOps burden. 
  • High availability.  The "service provider" will never be down since you have embedded it. 
  • Performance of invocation. No time is wasted on network communication. 

Cons of Library Dependency:

  • Performance as far as database pooling is concerned.  Say a "service provider" talks to a database with 10 connections in the pool. 10 consumers embedding this service library will have 100 connections in total for the database. The load of the db server can be huge.  There will be no such problem if the service provider is a standalone application, since it is only this provider application that can talk to its related database. 
  • High coupling regarding private API calling.  If no limit is enforced, a service consumer can call aprovider’s private API, that is, methods that the consumers are not supposed to call.  You can introduce limitations simply by declaring rules, but it is not 100% safe. People may break it. 
  • Indirect dependency version conflict. C relies on S1 which relies on COM_version_1.1, and C also relies on S2 which relies on COM_version_1.2.  C will then have to decide which version of COM it should rely on.  In Java, you may have to do a lot of Maven arbitration work.  And believe me, the story can be huge. 
  • Implicit dependency. Normally there is no service governance and I don’t know who relies on me.  If I am going to upgrade my interface, I don’t know exactly which consumer systems should be involved .  Maybe I can somehow find out all the consumers who relies on me directly, but it may still be hard to know who relies on me INDIRECTLY !  In RPC approach, you know who calls you if there is well governance. And you don’t need to know who indirectly calls you because it’s not your responsibility. Your direct consumers will find out and use their own discretion. 
  • Most annoying: change in implementation of a provider leads to upgrade and redeployment of its consumer systems. Let’s say you fixed a bug in the provider’s implementation. Interface is not changed, but the consumers still need to upgrade their dependency on you!  How annoying?  This normally leads to a very long release process, because a lot of systems need to be upgraded instead of just one. 

Layering in Java Webapps – My Final Version

What a good layering solution should do

It must handle the following problems:

1. Dividing a system into layers based on modularization principles, such as single-direction dependency, limited entry of components and so on. 

2. Compatible with modern dependency management tools such as Maven 

3. Allowing for evolving into a distributed system in the future 

Here I present my final version of layering after so many years’ development of java webapps.


Layers

layering

Components’ Responsibilities

Component Responsibility Naming Conventions
Biz Layer – Neutral Services  Most reusable core business logic(CRUD, Biz Modelling etc.)

Bean: Xxx

Facade: XxxService

Repository: XxxRepos/ XxxDAO

Application Layer – Front-End Managers The complete logic of use cases for front end users (customers)

Bean: FoXxx

Facade: FoXxxManager

Application Layer – Back-Office Managers The complete logic of use case for back office users (admin, staff etc.) 

Bean: BoXxx

Facade: BoXxxManager 

Application Layer – Partner System Service Providers Remote services for partner systems which belong to the same company

Bean: SsXxx(Ss= Some System)

Facade: SsRpc

Web Layer – Front-End Controllers

Web MVC controllers to serve front end users with browsers

Presentation only. No biz at all

Bean: N/A 

Facade:  FoXxxController

Web Layer – Front-End Web Services

Web Services to serve front end users’ rich clients (desktop/phone app)

Presentation only. No biz at all 

Bean: N/A

Facade: FoXxxResource/FoXxxSoap

Web Layer – Back-Office Controllers

Web MVC controllers to serve back office users with browsers

Presentation only. No biz at all

Bean: N/A

Facade: BoXxxController 

(Continue)

Component Bean DTO Dependency on Layer Below
Biz Layer – Neutral Services 

Fine-grained entity

Data oriented 

Beans as DTOs 

N/A
Application Layer – Front-End Managers

Very coarse-grained

Crossing multiple biz entities 

User oriented

Request/Response pairs as DTOs

Response has error props

Order order = convertSomehow(foNewOrderRequest); 

authService.permissionCheck(currentUserId…) ;

orderService.saveNewOrder(order);

itemService.doAnotherThing(…);

FoOrder foOrder = combineSomehow(order, item, permission);

return FoResponse.success(foOrder); 

Application Layer – Back-Office Managers

Very coarse-grained

Crossing multiple biz entities

User oriented

Similar with above

Similar with above
Application Layer – Partner System Service Providers

Fairly fine-grained 

Partner system oriented 

Beans as DTOs

Order order = convertSomehow(ssOrder); 

itemService.doAnotherThing();

SsOrder ssOrder = addOrRemoveProp(order);

return ssOrder; 

Web Layer – Front-End Controllers N/A N/A

FoNewOrderRequest foNewOrderRequest = webFormToObj(httpRequest); 

FoResponse foResponse  = foOrderManager.doSth(currentUserId, foNewOrderRequest);

httpResponse.putSomehow("data",  foResponse.getData());

httpResponse.putSomehow("error",  foResponse.getError());

Web Layer – Front-End Web Services N/A N/A

FoNewOrderRequest foNewOrderRequest = jsonToObj(jsonRequest); 

FoResponse foResponse  = foOrderManager.doSth(currentUserId, foNewOrderRequest);

return toRestResponse(foResponse); 

Web Layer – Back-Office Controllers N/A N/A Similar with "Front-End Controllers"

Be Pragmatic (Anti-Patterns)

1. Managers are allowed to call repositories(DAOs) directly. It is annoying to have to have a same-name method in XxxService to wrap  its counterpart in XxxRepo.

2. BoXxx/BoRequest/BoResponse should rely on and be allowed to extend FoXxx/FoRequest/FoResponse,  and Back office Controllers should be allowed to call Front end managers.  This is because  back office users are also users.  If you don’t allow this "slight violation", you may end up write tons of duplicate code in Bo Managers. 

3. Web layer annotations such as @XmlElement should be allowed to put on application layer beans and DTOs.  If not, you will have to create a lot of duplicate beans/DTOs on web layer. 


Maven Projects

ideal-maven-projects

(Note: webapp must have "runtime" dependency on impl_fo and impl_bo, which has not been shown in the diagram.)

This is a ideal version of separation.  The problem is it has too many maven projects.   

A pragmatic version


Support for Distributed System Design

1. The maven project "intf-pso" is used as the RPC client stub for partner systems

2. The maven project "intf-fo" can be used as the RPC client stub for web service clients if they are also written in Java

3. The web layer can be an independent system immediately without any compilation error since they only rely on app-layer interfaces at compile time.  Just let intf-fo and intf-bo be its client stub.  In advance, the web layer’s components can each be an independent system. 

If the web layer is transferred to another team

If the web layer is going to be transferred to another team in your organisation, the app layer should go with it. Otherwise, a small change will involve two team’s work, which is unbearable.

They will now be responsible for all the user-specific cases and considered as your partner system. 

web-another-team

The original FO and BO managers should not rely on the biz services any more.  Instead, you must create pso interfaces to wrap them and provide services to the original managers.  


A demo webapp

For a demo webapp with most of the components describe, see  this project on github

How to signify the end of a self-defined message in TCP programming?

TPC’s data transfer is based on stream. If the two sides don’t agree on how to detect the end of self-defined message, the receiver won’t know the boundary of a message. 

A simple way is to have a special character is the ending flag, such as "2 new lines".

The problem is that the message body can contain ending flags. 

A common-used approach is to specify the byte length of your message. You need to define a header and a body, in the header you tell the length of all the message. It is complicated for message receivers, though. 

Tips about writing a scraper


Workflow Model:  Download All Pages Before Parsing Them

Website scraping is a batch work. Basically there are two workflow models in this job.  One is: download page1 => parse page1 and save records of page1 =>  download page2 => parse page2 => …  .   The other is  download all the pages => parse and save all the pages .  You’d better go to the "download all then parse and save"  approach.  

Why is that?  Think of failing situations. You may fail in the middle of a batch process due to parsing error (mainly caused by your not-so-robust paring program).  If you are going with "download one and parse one"  during which a parsing error happens,  you may have to spend a while to investigate and correct your program, during which AWS EC2 (if you are using one) will not stop charging you, and the site you are scraping may have found your "attack" and starts to bring up an anti-scraping mechanism. What’s worse is that when you retry your program, you may have to re-download the pages you have already downloaded, unless your program recorded when it last stopped and knows how to restart from there. It’s doable, but tricky, and normally it’s not worth it.  Finally one retry doesn’t necessarily work. Your program may have bugs again and again. It normally takes more than 2 revisions to get a perfect version. That will bring more frastruation.

On the other hand, in a "download all first" approach a parsing error will not lead to the problems mentioned above.  You’ve got all your pages. All the material is in your hand, you’ll have less pressure.

Time management consideration is another factor that you should choose "download all first".  You don’t want to restart "downloading" since it’s time consuming, while you can redo parsing because it can be done in a few minutes.  To sum up, first deal with the things that is not totally under you control, then do the left job with less worries.

Save All Files in a Fixed Path and Use a Single File Path API

Page downloading involves retrying. You don’t want to re-download the pages that you have already downloaded during previous trials. One way to achieve this is to test if their corresponding files are already existing.  That’s why you must save the files in a fixed path on every try.

You may also want to use a single file path API for all the modules of your program to decide where the files are or should be, so that you don’t need to pass the paths as module parameters. In this way you don’t only simplify your code but also enforce the "Fixed Path" scheme.

Log Errors and Important Statistics

You must log errors to find out whether you have got all the data,  how failures happen and which pages need to be re-downloaded.

You should also record key statistics, such as how many records some landing page tell you there will be, so that you can validate your final results against this number.  Also it provides the foundation for time measurement.

Make your Downloading Faster and More Robust

To make the downloading faster, you can adopt a thread pool based design to download the pages in parallel. You must also reuse your http-connection since establishing a connection is quite time-consuming. If you are using Java, try Apache’s HttpClient’s  PoolingHttpClientConnectionManager.

Let your downloading worker retries itself for 2-5 times when it fails to download a page, so as to increase the chance that you get all the data in one batch.  You can let it sleep 100ms before retrying, so that the website can "take a breath" to serve you again.  You must figure out what is failure and which failures are "retriable".  Here is my list:

1. Http Error with code >= 500 is a retriable failure
2. Http 200 saying something like "cannot work for now" is a retriable failure
3. Http 200 with too little data is a retriable failure
4. Network Issue such as Timeout, IO Exception is a retriable failure

Deal with the Website’s Performance Issue

A typical performance problem of the target website is that it may fail or refuse to work when you are querying upon a large data set. For example, if you search something within the "shoes" category, you may get the results soon; but when you search the same thing within the "clothes" category, it may take quite a while or even fail you.

Another problem is related to paging. You may find that  the program runs well when it scrapes the first pages, and starts to fail when the page index reaches 100.  This happens a lot to Lucene/Solr based websites.  

To deal with the 2 problems above, you could split your target category into small ones, and do the query upon each.  Small categories normally have much less records and less pages.

Exploit the Cloud if Necessary
 

Scraping program must run in a strong computer, unless you are only targeting a small dataset.  It requires a network access close to your target website. It must have multiple CPUs if the program involves multi-threads.  It should also have big memory. Finally,  it must have a large storage otherwise your disk could soon be full.

Your own computer may not satisfy all the requirements.  In this case, try Cloud.  I often use AWS EC2 to scrape US websites. I can get all I need from EC2, and its cost is low and flexible.

Be Honest to your Sponsor

If you are doing the scraping for yourself, you can ignore this part.  Read on if you are a freelancer or if you are doing it for your client or your boss.

Websites may not be accurate themselves.  The landing page can say that it has 10k records but it actually only has 9.5k.  One tells you that you’ll get 1000 in total when you do the query 20 records per page, and when you chose 100 records per page to make it run faster, you end up with only 900 records in total.

Find out the inaccuracies of the website and let your sponsor know. Let them know that it is not your fault that some records will be missing.

Sometimes you just can’t download all the records due to the site’s poor performance,  tech difficulties, or time/budget limitation.  Also let your sponsor know.  Ask him how many percent loss of records he can endure and you two reach an acceptable deal.

Be honest to your sponsor. It’s better you find out the problems rather than he does.

When the scraping is done, provide your sponsor a validation report.  Tell them how many records are missing for each category (You can do this by analyze the final results and the logs) and provide the links for them to check.  Let him feel that your job is under his control.

集群环境下慎用本地缓存

集群环境下慎用本地缓存。

用户1在机器A上看到100条记录,用户2在机器B上看到的却是90条记录。

你会说你的业务允许两边看到不一样。 是的,两个用户看到的不一样不要紧。

但是同一个用户看到不一样的话,用户体验会非常差,差到要骂人。 例子是:用户1在机器A上提交表单删除100条记录,服务端处理完毕后让浏览器跳转(Redirect after Submission),负载均衡将这个请求跳转到机器B上,机器B上的本地缓存没变,所以仍然是删除前的记录数。

用户1看到这个结果,脑子里只有一个想法:删除没起作用。

所以,集群环境下使用本地缓存,一定要保证同一个用户先后访问的是同一台机器。

基于比较的分页机制 V.S. 页码式分页

基于比较的分页机制中,输入是一个被比较值。 页码式分页机制中,输入则是页码。

用户体验
对于数据不断增长的功能,页码式分页机制在用户体验方面有个缺点:你翻到下一页时可能会看到刚刚已经看到过的记录。

以贴吧为例,你进到第一页时,你看到的是06,05,04这三条记录;翻到第二页时,本应看到的是03、02、01。 然而在你翻页之前,另一个用户插进了记录07记录; 这时再看第二页时,系统认为此时第一页是07、06、05,于是把第二页04、03、02返回给你,而你刚刚已经看了记录04.

tps越高,这种现象就越严重。

而基于比较的分页机制就没有这个问题。第一页看到06, 05, 04; 翻到第二页时,客户端 “告诉系统我要比04更小的三条记录”, 不管这时有没有新增记录07,系统收到的指令都是“比04更小的三条记录”,所以总是返回03、02、01.

性能
如果要对分页查看列表的功能进行性能优化,一个常见的策略就是对前几页进行缓存。用缓存有一个问题:缓存的key是哪些?

对于页码式分页机制,缓存的key就是页码;要缓存前N页,只需缓存N个key对应的数据。

而对于比较式分页机制,缓存的key是什么? 它可能是你系统中的任何值,而且随着数据的增减,这个值可能原来是页尾,马上就又不是了。 要缓存多少个key?  只能把所有值都缓存一遍。除了首页由于被比较值固定为0可以缓存之外,其他页都无法缓存。

用户足迹追踪
比较式分页机制对用户足迹追踪不利。你很难根据访问记录(如access log),决定大部分用户会下翻多少页;而页码式分页机制就没这个问题。

开发A/B测试功能的注意事项

有一些注意事项:

1. 用户所见的一致性: 张三始终看到的是A版本,李四始终看到的是B版本,否则用户会很疑惑,甚至感到被玩弄感情。关于一致性还要考虑一些边界情况:

    a. 一个匿名用户在同一台机器上操作多次,应该看到同一个版本

    b. 一个匿名用户看到A版本以后,再注册,仍然应该看到A版本

2. 系统应该有一个后门,随心所欲地在A/B间互切,以方便在测试阶段进行测试。

3. 使用框架,尽量减少对业务代码的侵入性,要终止A/B测试时,可以轻松搞定。框架应该重点解决一个问题:封装一个接口,返回当前请求应该对应的版本号,业务执行代码自己不应该做这个判断; 这个接口的实现可以是框架内置的通用逻辑,也可以由使用者根据特定业务实现。

4. 一次分桶测试结束后,应该清理这次测试在业务代码中的残留点,否则随着测试越来越多,这些残留点会使代码的可读性越来越多。

部分参考了:
http://www.slideshare.net/patio11/ab-testing-framework-design-3296257

影响业务逻辑的标签、徽章应该由开发人员创建

社区系统里的各种标签、徽章应该由谁创建,由谁来贴标签、颁发徽章?

贴标签、颁发徽章当然要由运营操作人员来做。

谁来创建呢?如果这个标签仅仅用来显示,而不会扭曲任何if/else逻辑,那可以由操作人员在后台界面上完成。

否则,就应该由开发人员创建,因为代码里要用它。如果还没创建好,怎么用它?

一般情况下,可以仅仅作为枚举类写死在代码里,连数据库表都不用建。

统计值在持久化前做一下校验:不能在值域之外

由于并发、程序bug或其他原因,你的统计值或多或少可能不那么准确。

如果说不准确可以勉强接受,超住值域之外就属于丢脸了。

如果你的页面上显示一个人发过的帖子数为-1 , 那就贻笑大方了。如果是0。即使不精确,也还说的过去。

所以,在持久化任何统计值之前,先把这个值正规化为值域的边界值(比如,小于0则置成0)。