Monthly Archives: March 2015

不同定位的功能,不要轻易集成到单个产品中

如果两类功能的用户定位不一样或者产品定位不一样,即使他们相近或者有关联,也没必要做成一个产品;完全可以做成两个不同的产品,这样才能让两个产品分别做到极致。

做到一起,可以形成“一站式综合体”,对用户有利;但有时就是过于迷信这一点,反而让人形成不伦不类的感觉:你会去沃尔码买吉他吗?   

人在创意产品跟在实现产品时的关注点完全不同

实现产品时会非常注重时间管理和执行效力,要做的事情越精简越好,概括起来就是追求收敛。

创意产品时不会那么在乎时间,因为思考本身的时间成本是很低的,多想想,也花不了多少时间;而且人在思考时又容易触类旁通,得到很多新的想法,让产品更丰满。创意时追求的是一种“发散”的风格。

好的软件工程师应该鼓励产品经理放飞想象力,或者跟他一起飞;但在产品实现时要严格控制项目规模和执行纪律。好的产品经理也应该理解高效执行的重要性,在确定本期产品范围后提供精准的产品需求文档(PRD) 

产品决策不是科学

产品决策不是科学。当发生争论时,两边的观点似乎都很有道理,谁也说服不了谁;这跟技术争论很不一样。

解决方法之一是独裁,让位高者决定,因为位高者的想法一般更“靠谱”;但在大型组织中,位高者对自己的绩效非常敏感(丢官、丢地盘、被逼走),所以决策时经常有自己的个人利益考量在内,为了KPI做一些短期行为。 

怎么办? 一是像苹果一样实施用户体验压倒一切的价值观,二是让公司维持小的团队规模,让老板自己来决策,因为他不需要向上汇报。

产品里可以有非核心功能,但要弱化

收到一封应届生简历,在介绍自己开发技能的同时,用了大量篇幅陈列自己的各种社会活动经验,严重喧宾夺主,看了两眼我就关了。

以前也收到过社招简历,我们招的是高级Java工程师,这封简历说自己C++和J2EE都很熟练,给予的篇幅也是1:1; 给人一种“样样通,样样怂”的印象,我看了后也就直接关掉了。

我这样做确实太主观了,并不理性、科学,但人性就是如此:如果你要推销自己,非核心的东西很有可能不但不加分,反而影响对你的整体认知,有可能让人怀疑的核心能力,也有可能带来庞杂感影响产品定位。如果这些简历里不提这些东西,或者只是简要提提,那可能就有面试机会了。

互联网产品设计也是这样。如果微信把购物、转账放在首屏,那微信是什么产品?人是矫情并且感性的,不会听你讲那么多道理,他们只管直觉。

非核心功能并非就不好,它可以起到丰富产品的作用。所以你可以把它放到一个角落里,或者把它视为另一个产品,只把当前产品作为它的入口。

“品牌”这个东西对互联网产品来说价值不是那么大

传统的营销理论把“品牌”视为关键竞争力。两款差不多的产品,品牌好的那个会占据更多的市场,即使它的质量稍差一些。所以有追求的企业会花大力气建立、维护品牌,品牌的重要性甚至超过产品本身。比如肯德基的汉堡,其实是非常一般的。

品牌重要,个人分析有三个原因:

1. 信息不对称,消费者在信息方面处于弱势。消费者自认为无法专业地考察产品的质量,对于原料选择、生产流程、购后的潜在问题更加不清楚。所以,还是要买品牌较好的,比较放心。

2. 更换成本高,导致消费者不太敢去“试一试”别的东西。如果发现的买的不好要换另一种,那已经付出的钱就算是浪费了;就算可以退换货,物流、时间成本也高。 考虑到这种风险,买品牌的综合成本是最低的。

3. 品牌的价格对自我印象的影响。买苹果、穿Nike肯定比买小米、穿班尼路更有面子。 

互联网产品没有这这些问题:

1. 信息对称。一个网站好不好用,一个APP好不好用,用户自己用用就一清二楚;至于网站背后用什么技术,性能架构好不好,用户仍然不知道,但这完全无所谓。 所以,大网站也装不了逼,不好用老子就走。

2. 更换成本为0. 反正都免费,说走就走,MySpace说垮就垮。

3. 选择哪个网站,对自己的面子没有帮助。用亚马逊还是用京东,都改变不了吊丝或高富帅的形象。 

 当然,品牌还是有用的,它有一个惯性作用,让用户会在一段时间内继续使用产品。 只不过在互联网行业中,这个惯性持续的时间会很短很短,你必须持续地提升产品质量和用户体验,才能维持这个惯性。

Tips about writing a scraper


Workflow Model:  Download All Pages Before Parsing Them

Website scraping is a batch work. Basically there are two workflow models in this job.  One is: download page1 => parse page1 and save records of page1 =>  download page2 => parse page2 => …  .   The other is  download all the pages => parse and save all the pages .  You’d better go to the "download all then parse and save"  approach.  

Why is that?  Think of failing situations. You may fail in the middle of a batch process due to parsing error (mainly caused by your not-so-robust paring program).  If you are going with "download one and parse one"  during which a parsing error happens,  you may have to spend a while to investigate and correct your program, during which AWS EC2 (if you are using one) will not stop charging you, and the site you are scraping may have found your "attack" and starts to bring up an anti-scraping mechanism. What’s worse is that when you retry your program, you may have to re-download the pages you have already downloaded, unless your program recorded when it last stopped and knows how to restart from there. It’s doable, but tricky, and normally it’s not worth it.  Finally one retry doesn’t necessarily work. Your program may have bugs again and again. It normally takes more than 2 revisions to get a perfect version. That will bring more frastruation.

On the other hand, in a "download all first" approach a parsing error will not lead to the problems mentioned above.  You’ve got all your pages. All the material is in your hand, you’ll have less pressure.

Time management consideration is another factor that you should choose "download all first".  You don’t want to restart "downloading" since it’s time consuming, while you can redo parsing because it can be done in a few minutes.  To sum up, first deal with the things that is not totally under you control, then do the left job with less worries.

Save All Files in a Fixed Path and Use a Single File Path API

Page downloading involves retrying. You don’t want to re-download the pages that you have already downloaded during previous trials. One way to achieve this is to test if their corresponding files are already existing.  That’s why you must save the files in a fixed path on every try.

You may also want to use a single file path API for all the modules of your program to decide where the files are or should be, so that you don’t need to pass the paths as module parameters. In this way you don’t only simplify your code but also enforce the "Fixed Path" scheme.

Log Errors and Important Statistics

You must log errors to find out whether you have got all the data,  how failures happen and which pages need to be re-downloaded.

You should also record key statistics, such as how many records some landing page tell you there will be, so that you can validate your final results against this number.  Also it provides the foundation for time measurement.

Make your Downloading Faster and More Robust

To make the downloading faster, you can adopt a thread pool based design to download the pages in parallel. You must also reuse your http-connection since establishing a connection is quite time-consuming. If you are using Java, try Apache’s HttpClient’s  PoolingHttpClientConnectionManager.

Let your downloading worker retries itself for 2-5 times when it fails to download a page, so as to increase the chance that you get all the data in one batch.  You can let it sleep 100ms before retrying, so that the website can "take a breath" to serve you again.  You must figure out what is failure and which failures are "retriable".  Here is my list:

1. Http Error with code >= 500 is a retriable failure
2. Http 200 saying something like "cannot work for now" is a retriable failure
3. Http 200 with too little data is a retriable failure
4. Network Issue such as Timeout, IO Exception is a retriable failure

Deal with the Website’s Performance Issue

A typical performance problem of the target website is that it may fail or refuse to work when you are querying upon a large data set. For example, if you search something within the "shoes" category, you may get the results soon; but when you search the same thing within the "clothes" category, it may take quite a while or even fail you.

Another problem is related to paging. You may find that  the program runs well when it scrapes the first pages, and starts to fail when the page index reaches 100.  This happens a lot to Lucene/Solr based websites.  

To deal with the 2 problems above, you could split your target category into small ones, and do the query upon each.  Small categories normally have much less records and less pages.

Exploit the Cloud if Necessary
 

Scraping program must run in a strong computer, unless you are only targeting a small dataset.  It requires a network access close to your target website. It must have multiple CPUs if the program involves multi-threads.  It should also have big memory. Finally,  it must have a large storage otherwise your disk could soon be full.

Your own computer may not satisfy all the requirements.  In this case, try Cloud.  I often use AWS EC2 to scrape US websites. I can get all I need from EC2, and its cost is low and flexible.

Be Honest to your Sponsor

If you are doing the scraping for yourself, you can ignore this part.  Read on if you are a freelancer or if you are doing it for your client or your boss.

Websites may not be accurate themselves.  The landing page can say that it has 10k records but it actually only has 9.5k.  One tells you that you’ll get 1000 in total when you do the query 20 records per page, and when you chose 100 records per page to make it run faster, you end up with only 900 records in total.

Find out the inaccuracies of the website and let your sponsor know. Let them know that it is not your fault that some records will be missing.

Sometimes you just can’t download all the records due to the site’s poor performance,  tech difficulties, or time/budget limitation.  Also let your sponsor know.  Ask him how many percent loss of records he can endure and you two reach an acceptable deal.

Be honest to your sponsor. It’s better you find out the problems rather than he does.

When the scraping is done, provide your sponsor a validation report.  Tell them how many records are missing for each category (You can do this by analyze the final results and the logs) and provide the links for them to check.  Let him feel that your job is under his control.