Show HN: Natural Language Processing Demystified (Part One)

jll29 3 years ago

NLP researcher here. It's great to see many offerings for courses and tutorials, and NLP has made a lot of progress, in terms of both its science as well as its re-usable software artifacts (ibraries & notebooks, standalone tools).

But what saddens me is too many people are trying to dive into NLP without trying to understand language & linguistics first. For example, you can run a part of speech (POS) tagger in three lines of Python, but you will still not know much about what parts of speech are, which languages have which ones, what function they have in linguistic theory or practical applications.

What are the advantages of using the C7 tagset over the C5 or PENN tagsets?

Why is AT sometimes called DET?

etc.

I recommend people spend a bit of time to read an(y) introduction to linguistics textbook before diving into NLP, then the second investment will be worth so much more.

mywaifuismeta 3 years ago

I'm generally not a fan of these kind of high-level tutorials that tell you "use X library to get Y result" - it's just not good for learning. But any content that tries to sell you on learning ML/NLP/etc in a few weeks is just that. I understand people want to make money by targeting a large audience, but it makes me sad when I see (the vast vast majority) of practitioners not having any understanding about ML (or NLP) and just blindly applying libraries.
I don't think you necessarily need a linguistics background for NLP, but I think you need either a strong linguistics OR ML background so that you know what's going on under the hood and can make connections. Anyone can call into Huggingface, you don't need a course for that.
- scarface74 3 years ago
  
  Everything eventually gets boiled down to libraries. The purpose of technology is to get things done. I could say the same that it makes me sad that today’s developers use high level languages without ever knowing assembly. A chip designer could say that assembly language developers are saddened that the assembly language programmer never had to learn how processors are created.
  
  mirker 3 years ago
  
  It’s fine when the library is a tight abstraction. Unfortunately, ML libraries are leaky.
  Example: take a classification model and change the output dimensions without understanding the model.
  
  nmstoker 3 years ago
  
  Yes, the challenge people then face is that if they lack too much intuition for the subject, they can't spot obvious issues.
  We've all seen how ML people don't necessarily have to have the skills to solve a problem (ie i don't need to speak Vietnamese to make a "passable" ML translator) but it's not long before the lack of knowledge starts to show up embarrassing shortfalls - being too arms length about any topic is a recipe for disaster!
- Der_Einzige 3 years ago
  
  Doing non trivial things (more than .train or .generate) with huggingface models def requires tutorials or other resources, not sure what you're on about at all.
amitport 3 years ago

NLP is a vast field nowadays, you can solve a research problem with a novel transformer architecture (for example) without knowing anything about linguistics. The thing is that NLP is a vast field and there is plenty room to go around (same goes for vision, you don't really need classical vision background as much as you used to).
(also an NLP researcher. Knows nothing about linguistics)
screye 3 years ago

I makes sense to completely disregard language when looking at modern NLP solutions. In some sense, 'hand engineering' anything is looked down upon.
Transformers and scaling laws have made it such that the only thing that truly matters is your ability to build a model that can computationally and parametrically scale. The 2nd would be to figure out how to make more data 'viable' for usable within such a hungry model's encoding.
Look at anyone who has written the last 20 seminal papers in NLP, and almost none of them have a strong background in linguistics. Vision went through a similar period of forced obsolescence, during the 2012-2016 Alexnet -> VGG -> Inception -> Resnet transition.
It is unfortunate. But, time is limited and most researchers can only spare enough time to learn a few new things. Unfortunately for linguistics, it does not rank that high.
adamsmith143 3 years ago

I'm not at all sympathetic to this viewpoint. The Deep Learning revolution has shown us time and time again that Deep Learning experts universally outperform SME on modelling performance. I an almost 100% certain that the teams building big Transformers which are now by far the best NLP models (OpenAI, Meta, Google Brain, Deepmind, etc) are not made up of linguistic experts but Deep Learning experts.
- amitport 3 years ago
  
  These groups are not mutually exclusive.
  
  adamsmith143 3 years ago
  
  Maybe not but I'd guess that in this context the marginal gain for learning more about Linguistics is going to be dwarfed by learning more about Deep Learning.
  
  sam_lowry_ 3 years ago
  
  In practice, they are, AFAIK.
philophyse 3 years ago

In your opinion, would George Yule's The Study of Language be a good introduction to linguistics? Or is there any other book that you would recommend to someone who has little knowledge of the field, but a lot of interest?
- photonemitter 3 years ago
  
  Jumping in on this; I’ve found jurafsky/martin a good place to start. Covers a lot of ground and is a pretty good read as well.
  https://web.stanford.edu/~jurafsky/slp3/
  
  ninjin 3 years ago
  
  As a somewhat established researcher in the field, I second Jurasky and Martin. It is peerless and what I recommend to anyone joining my team if they think their background NLP knowledge is a bit on the weak side.
true_religion 3 years ago

I don’t know if anyone wants to dive into NLP as much as they just want to solve their problem at hand.
You are right that lack of fundamental knowledge is problematic, especially that tools can allow you to make a greater quantity of solutions and therefore also a greater quantity of mistakes.
However, at least the problem is still being solved.
For example, a few months ago I wanted to organize my media collection by tagging files with artist names. I had a list of artist names but it wasn’t comprehensive so I wired together a bunch of python NLP libraries together to automatically pull out proper nouns from filenames, recognize English names, then annotate the files.
I know almost nothing about parts of speech or anything else, so I made mistakes. About 10% of the results were errors in the first run, but after tuning it was down to about 1% which was good enough to run over the entire media library.
If not for the tools, I would have never been able to finish that chore in a single day. To me, it was worth it despite my amateur mistakes.
I view the library just like any other tool: a screw driver, a hammer, a wrench. I’m not a plumber or a carpenter, or an NLP researcher but I still want to use tools to fix my leaky faucets, remount my leaning cabinet doors, and organize my media collections as weekend projects.
LunaSea 3 years ago

Is this still true in an era where most NLP problems use language models as a solution?
- gattilorenz 3 years ago
  
  I think so. First of all, knowing some linguistics will teach you terms and concepts (e.g. parse tree, phrase, morpheme, phoneme, etc) that will both help you find relevant literature and avoid reinventing terms for stuff that is widely known (so others will more readily find your work).
  Language models are currently the best solution for many problems, but it's hard to predict how we will move forward from here. Maybe the inclusion of linguistic information, or linguistic-inspired knowledge, or whatever, will be the key to having better results, or saving training time/resources. With no linguistics background, I imagine it's hard to get ideas going in that direction (and test if it's actually a good direction)
  
  mothcamp 3 years ago
  
  I agree. I think having linguistics knowledge can help especially in applied situations. Linguistics knowledge can help create fallback systems when an ML system fails, or help build rules to amplify or dampen the confidence of a response from an ML system, or aid in the engineering of a system (all that comes before or after the ML blackbox).
  Sort of like an algorithmic trader knowing market microstructure intimately (versus only pure statistics).
- k8si 3 years ago
  
  Language models as a solution to what problems?
  Yes, you can easily use AutoModel.from_pretrained('bert-base-uncased') to convert some text into a vector of floats. What then?
  What are the properties of downstream (aka actually useful) datasets that might make few-shot transfer difficult or easy? How much data do your users need to provide to get a useful classifier/tagger/etc. for their problem domain?
  Why do seemingly-minor perturbations like typos or concating a few numbers result in major differences in representations, and how do you detect/test/mitigate this to ensure model behavior doesn't result weird downstream system behavior?
  How do you train a dialog system to map 'I'm good, thanks' to 'no'? How do you train a sentiment classifier learn from contextual/pragmatic cues rather than purely lexical ones (example: 'I hate to say it but this product solves all my problems.' - positive or negative sentiment?)
  How bad is the user experience of your Arabic-speaking customers compared to that of your English-speaking customers, and what can you do to measure this and fix it?
  My linguistics background really helps me think through a lot of these 'applied' NLP problems. Knowing how to make matmuls fast on GPUs and knowing exactly how multihead self-attention works is definitely useful too, but that's only one piece of building systems with NLP components.
  
  riku_iki 3 years ago
  
  > My linguistics background really helps me think through a lot of these 'applied' NLP problems.
  There many benchmarks where LMs absolutely outperform mechanical linguistics solutions.
  Do you have success stories when there is significant outperforming solution in opposite direction?
  
  k8si 3 years ago
  
  There's no competition between linguistics and ML/NLP, they have completely different goals as fields.
  I meant that my linguistics background helps me understand & solve problems: studying linguistic field work has helped me design crowd labeling jobs, knowing about morphology helps me understand why BPE tokenizers work so well (and when they might not), knowing about syntax/dominant word order makes me think that multilingual Bert should probably do something more intelligent with positional embeddings, methods from psycholinguistics are useful for understanding entropy/surprisal wrt LM next-word probabilities... just a few examples but the list could go on.
xtiansimon 3 years ago

“I recommend people spend a bit of time to read an(y) introduction to linguistics textbook…”
Linguists is a broad area of study. Can you be more specific? Such as grammar and syntax.
ad404b8a372f2b9 3 years ago

"Every time I fire a linguist, the performance of the speech recognizer goes up"
- Frederick Jelinek
vb234 3 years ago

Could you recommend a good introduction to NLP book?
meristem 3 years ago

Do you have specific book suggestions?
PainfullyNormal 3 years ago

> I recommend people spend a bit of time to read an(y) introduction to linguistics textbook
Do you have a favorite you can recommend?
- sam_lowry_ 3 years ago
  
  Elements by Tesnière. I am not kidding, there is a shitload of knowledge there, largely forgotten by the time NLP merged with CompSci.
  Jurafsky and Martin, Manning and Schutze are great books for computer scientists but these do not teach about the language.
- mothcamp 3 years ago
  
  In addition to Jurafsky and Martin (https://web.stanford.edu/~jurafsky/slp3/), I also like Emily Bender's book: https://www.goodreads.com/book/show/18128399-linguistic-fund...
  Bender's book is NOT an end-to-end text though imo. It's more a central jumping off point. So you can read about a concept and if it sounds interesting, search more about it.
- rmellow 3 years ago
  
  In addition to Jurafsky and Martin, I recommend Foundations of Statistical NLP by Manning and Schutze: https://nlp.stanford.edu/fsnlp/promo/

mothcamp 3 years ago

Hi HN:

I published part one of my free NLP course. The course is intended to help anyone who knows Python and a bit of math go from the very basics all the way to today's mainstream models and frameworks.

I strive to balance theory and practice and so every module consists of detailed explanations and slides along with a Colab notebook (in most modules) putting the theory into practice.

In part one, we cover text preprocessing, how to turn text into numbers, and multiple ways to classify and search text using "classical" approaches. And along the way, we'll pick up useful bits on how to use tools such as spaCy and scikit-learn.

No registration required: https://www.nlpdemystified.org/

irln 3 years ago

The interface is great. Did you create the front-end/back-end from scratch?
- mothcamp 3 years ago
  
  Thank you. Yep. It's all statically-generated pages using Next.js with a single Next.js API route for the subscription. All hosted on Netlify.

jasfi 3 years ago

I'm working on extracting facts from sentences, see https://lxagi.com.

Which are the toughest NLP problems you know of that aren't being solved satisfactorily?

Der_Einzige 3 years ago

Queryable, word level, extractive summarization with grammatical correctness. AKA: what a human does when they are "highlighting" a document.
think extractive QA but the answer size should be configurable and the answer can potentially be multiple spans, and spans may not need to be contiguous.
If you got a solution, I'd love to see it - and you could even beat the baselines for the only dataset that exists for it: https://paperswithcode.com/sota/extractive-document-summariz...
- jasfi 3 years ago
  
  Thanks, I'll add that to the list of possible use cases, although that will take additional time. The solution won't be ready anytime soon, so please sign-up for the announcement list on the website if you're interested.
- jasfi 3 years ago
  
  Can I ask more about your interest in NLP? How can I contact you?
riku_iki 3 years ago

Actually, problem you are working on doesn't look like solved satisfactory yet :-)
- jasfi 3 years ago
  
  Thanks, good to know!
airstrike 3 years ago

Getting an invalid HTTPS certificate
- jasfi 3 years ago
  
  It works for me, which browser are you using? Can you see the certificate?

Utkarsh_Mood 3 years ago

Looks great, thanks for this!