lucidiot's cybrecluster

chinese date format

Format

Characters

Character Meaning
0
0
1
2
3
4
5
6
7
8
9
10
二十 20
三十 30
Year
Month
Day

jq implementation

I implemented a chinese date parser using jq for itsb, as seen here.

The parsing methods are in two parts: parse_chinese_number, a parser that is only guaranteed to work from 0 to 99, and parse_chinese_date, which splits the date components and sends them to parse_chinese_number.

This parser only works with jq≥1.6, as jq 1.5 and earlier had some Unicode issues that caused most string manipulations in this parser to break.

Number parsing

In Chinese dates, years are always expressed using all of their digits, aka “two zero two zero” for 2020, and not “two thousand twenty”. This makes the parsing much simpler as I do not need to even know how thousands or hundreds are expressed in Chinese; I still however need tens to handle months and days.

I first started by writing a parser that only handles single digits; I made an object mapping Chinese characters to their string digits, and just translated each character then re-concatenated.

($input // "")     # ["二", "零", "二", "零"]
| map($charmap[.]) # ["2", "0", "2", "0"]
| join("")         # "2020"

I then added 10 as an empty string, because we can just ignore it when going number by number. This only works in some cases:

Number Parsed as Expected Actual
二十八 二八 28 28
十八 18 8
二十 20 2
"" 10 Type error

I chose to ignore any case where more than one 十 would be found as my goal was only to parse in the 1-31 range, and 十十十一 is longer than 三一 or 三十一 so I can expect them to not be used.

The remaining edge cases only occur when is at the start or the end of the string, so I handled them in three ways:

And to avoid adding complexity to my feed parsing scripts, I changed the map to map($charmap[.] // .), which just ignores unknown characters. Combined with some checks made by the date parsing function, this makes it possible to parse both traditional and simplified formats without making many changes.

Date parsing

The date components are split using a regex, to avoid sending too much garbage to parse_chinese_number in the event of a badly formatted date. Once the numbers are parsed, we get an object in this format:

{"year": 1234, "month": 12, "day": 3}

In some cases, some investigation agencies will be using years from the Chinese calendar, where year 0 is year 1911 of the Gregorian calendar. I therefore added a check that adds 1911 years when the year is below 1900; this causes the parser to work properly only for years between 1900 and 2811 inclusive for dates using this calendar.

I then use a rather simple method to get a Unix timestamp: "\(.year)-\(.month)-\(.day)T00:00:00Z" | fromdateiso8601. I could have used a more normal method such as a strptime("%Y-%m-%d") | mktime, or build the same array that strptime returns, such as [1234, 11, 2, 0, 0, 0, 0] | mktime, but that requires some particular handling as months and days are zero-based in this format.

Acknowledgements

Thanks to ~m455 for the Chinese dates crash course on IRC!


Licensed under Creative Commons Attribution 4.0 International Generated on 2024-04-02T12:19:24+02:00 using pandoc 2.9.2.1