6136

Awk: What wrong with CJK characters? #Korean

Question:

<strong>Given a .txt files</strong> with space-separated words such as:

But where is Esope the holly Bastard But where is 생 지 옥 이 군 지 옥 이 지 옥 지 我 是 你 的 爸 爸 ! 爸 爸 ! ! ! 你 不 會 的 !

And <strong>the Awk function</strong> :

cat /pathway/to/your/file.txt | tr ' ' '\n' | sort | uniq -c | awk '{print $2" "$1}'

I get the <strong>following output</strong> in my console which is <strong>invalid for korean words</strong> (valid for english and Chinese space-separated words)

생 16 Bastard 1 But 2 Esope 1 holly 1 is 2 the 1 where 2 不 1 你 2 我 1 是 1 會 1 爸 4 的 2

<strong>How to get it works for korean words ?</strong> Note: I actually have 300.000 lines and near 2 millions words.

<hr />

EDIT: Used answer:

$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort > myfileout.txt

Answer1:

A single awk script can handle this easily and will be far more efficient than your current pipeline:

$ awk '{a[$1]++}END{for(k in a)print k,a[k]}' RS=" |\n" file 옥 3 Bastard 1 ! 5 爸 4 군 1 지 4 But 2 會 1 你 2 the 1 是 1 不 1 이 2 Esope 1 的 2 holly 1 where 2 생 1 我 1 is 2

If you want to store the results into another file you can use redirection like:

$ awk '{a[$1]++}END{for(k in a)print k,a[k]}' RS=" |\n" file > outfile

Recommend

  • Does same chinese characters shared by cjk share same unicode value?
  • schema not detected in google structured-data testing-tool
  • How do I change POPUP Text of Menu without ID
  • compilation issue when running theano
  • Search through sentences
  • How to extract distinct part of a string from a file in linux
  • Make existing column unique in Rails
  • Python 2.7 on Windows — Too Many Open Files
  • Remove characters after a specific character in column
  • Selection Sort, For Java
  • Custom locale in Android
  • CFBundleDevelopmentRegion not works as expected
  • garbled css name when styling within UiBinder
  • How to open html table in xls on click of a button
  • Can't remove headers after they are sent
  • JPA flush vs commit
  • Get specific string
  • Unable to decode certificate at client new X509Certificate2()
  • How can I sort a a table with VBA with given text condition?
  • Is it possible to access block's scope in method?
  • Checking free space on FTP server
  • Read a local file using javascript
  • Apache 2.4 and php-fpm does not trigger apache http basic auth for php pages
  • Deserializing XML into class C#
  • Release, debug version and Authorization Google?
  • Function pointer “assignment from incompatible pointer type” only when using vararg ellipsis
  • align graphs with different xlab
  • Return words with double consecutive letters
  • Run Powershell script from inside other Powershell script with dynamic redirection to file
  • Unanticipated behavior
  • Is there a mandatory requirement to switch app.yaml?
  • Transpose CSV data with awk (pivot transformation)
  • Load html files in TinyMce
  • How can I get HTML syntax highlighting in my editor for CakePHP?
  • coudnt use logback because of log4j
  • python draw pie shapes with colour filled
  • Does armcc optimizes non-volatile variables with -O0?
  • Reading document lines to the user (python)
  • How to Embed XSL into XML
  • Python/Django TangoWithDjango Models and Databases