76703

SQL: Find longest common string between rows

Question:

I have a table T1:

Col ------- 1 THE APPLE THE APPLE THE APPLE 123 THE APPLE 12/16 BEST THE APPLE

I want T2:

Result -------- THE APPLE

I am using Redshift, Looking for some way to do a fuzzystring match in SQL. Longest length possible for column is 100 characters. At no point will I have to compare more than 25 rows.

Answer1:

This question requires a fair degree of complication to solve and it's running time will drastically increase with increasing string lengths and number of records. However, given that your table T1 is rather small, you might just manage with the below PL/pgSQL function.

<h2>Algorithm</h2> <ol><li>Find the shortest value in T1(col). This is the longest possible match among all records. This is the candidate string.</li> <li>See if the candidate is present in all other rows of T1. If so, return the present candidate to the result set.</li> <li>Move the candidate over one position in the shortest value and back to step 2 until the candidate reaches the end of the shortest string.</li> <li>If matching candidates have been found, return from the function. Otherwise, shorten the candidate by 1 and start over from the beginning of the shortest string and move to step 2. If no more candidates can be extracted from the shortest string, then return NULL.</li> </ol><h2>Code</h2>

The important thing in the code below is the short-circuit in the check for a match: as soon as a single record does not match col to the candidate string there is no need to check further. So for long strings, the comparison is really from the shortest string with one other string, increasing the rows examined only when candidate strings get so short that they are indeed more ubiquitous.

The string comparison is case-sensitive; If you want to make it case-insensitive, change LIKE to ILIKE. As a bonus feature, you will get multiple matching strings (obviously all of the same length) that are all present in all rows. On the down side, it will report multiple identical strings once it gets down to single char matches (and possible some 2-char and longer strings). You can use a SELECT DISTINCT * to get rid of those duplicates.

CREATE FUNCTION find_longest_string_in_T1() RETURNS SETOF text AS $$ DECLARE shortest varchar; -- The shortest string in T1(col) so the longest possible match candidate varchar; -- Candidate string to test sz_sh integer; -- Length of "shortest" l integer := 1; -- Starting position of "candidate" in "shortest" sz integer; -- Length of "candidate" fail boolean; -- Has "candidate" been found in T1(col)? found_one boolean := false; -- Flag if we found at least one match BEGIN -- Find the shortest string and its size, don't worry about multiples, need just 1 SELECT col, char_length(col) INTO shortest, sz_sh FROM T1 ORDER BY char_length(col) ASC NULLS LAST LIMIT 1; -- Get all the candidates from the shortest string and test them from longest to single char candidate := shortest; sz := sz_sh; LOOP -- Check rows in T1 if they contain the candidate string. -- Short-circuit as soon as a record does not match the candidate <<check_T1>> BEGIN FOR fail IN SELECT col NOT LIKE '%' || candidate || '%' FROM T1 LOOP EXIT check_T1 WHEN fail; END LOOP; -- Block was not exited, so the candidate is present in all rows: we have a match RETURN NEXT candidate; found_one := true; END; -- Produce the next candidate IF l+sz > sz_sh THEN -- "candidate" reaches to the end of "shortest" -- Exit if we already have at least one matching candidate EXIT WHEN found_one; -- .. otherwise shorthen the candidate sz := sz - 1; l := 1; ELSE -- Exit with empty result when all candidates have been examined EXIT WHEN l = sz_sh; -- .. otherwise move one position over to get the next candidate l := l + 1; END IF; candidate := substring(shortest from l for sz); END LOOP; RETURN; END; $$ LANGUAGE plpgsql IMMUTABLE;

Calling SELECT * FROM find_longest_string_in_T1() should do the trick.

<h2>Simple testing</h2>

Create some test data:

INSERT INTO T1 SELECT 'hello' || md5(random()::text) || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25); INSERT INTO T1 SELECT md5(random()::text) || 'match' || 'hello' || md5(random()::text) || md5(random()::text) FROM generate_series(1, 25); INSERT INTO T1 SELECT 'match' || md5(random()::text) || 'hello' || md5(random()::text) || md5(random()::text) FROM generate_series(1, 25); INSERT INTO T1 SELECT md5(random()::text) || 'hello' || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);

This generates 100 rows of 106 characters long and yields the matches "hello" and "match" (and very unlikely any other matches). This produces the correct two strings in less than half a second (no frills Ubuntu server, PG 9.3, CPU i5, 4GB of RAM).

Answer2:

If you're okay with getting the most commonly appearing word among all rows (the most common word that is separated by a space), you could use:

select word, count(distinct rn) as num_rows from( select unnest(string_to_array(col, ' ')) as word, row_number() over(order by col) as rn from tbl ) x group by word order by num_rows desc

<strong>Fiddle:</strong> <a href="http://sqlfiddle.com/#!15/bc803/9/0" rel="nofollow">http://sqlfiddle.com/#!15/bc803/9/0</a>

Note that this finds the word apple among 4 rows, not 5. This is because APPLE123 is one word, whereas APPLE 123 would be two words, one of which is APPLE, and would count, but it doesn't.

Recommend

  • HTML5 video is playing in particular site but not in my domian ie
  • Scaling in flex - exclude some children?
  • Make this simple for/if block more pythonic
  • Minimum API for implementing Android Deep Linking
  • How to use commands only when a current command is triggered?
  • quick sort improvement using median of three robert sedwick
  • c++, evaluating expressions with multiple `&&`s and no operator of lower precedence
  • Python's equivalent of Java's Set.add()?
  • MySQL like query runs extremly slow for 5000 records table
  • How to reference a widget inside a specific tab of a QTabWidget?
  • Why is new Number(8) not exactly equal to 8?
  • What is the difference between Socket.Send and Stream.Write? (in relation to tcp ip connections)
  • Concise regex extract function in XSLT 2.0
  • Draw half infinite lines?
  • several dataProvider per one Test in TestNG
  • What causes the runtime difference in this trivial fortran code?
  • Find group of records that match multiple values
  • How to specify input and output paths from cmd.exe for a PowerShell script?
  • DIV instruction jumping to random location?
  • Cannot upload to OneDrive using the new SDK
  • How do I signal completion of my dataflow?
  • Convert Type Decimal to Hex (string) in .NET 3.5
  • Diff between two dataframes in pandas
  • What does 'Language neutral' mean with regard to MAKELANGID?
  • Moving Android View and preventing onDraw to be called over and over again
  • xtable package: Skipping some rows in the output
  • Parsing a CSV string while ignoring commas inside the individual columns
  • How to avoid particles glitching together in an elastic particle collision simulator?
  • Avoid links criss cross / overlap in d3.js using force layout
  • Recording logins for password protected directories
  • Sails.js/waterline: Executing waterline queries in toJSON function of a model?
  • Splitting given String into two variables - php
  • Check if a string to interpolate provides expected placeholders
  • Redux, normalised entities and lodash merge
  • When should I choose bucket sort over other sorting algorithms?
  • Do create extension work in single-user mode in postgres?
  • Acquiring multiple attributes from .xml file in c#
  • How to CLICK on IE download dialog box i.e.(Open, Save, Save As…)
  • How can I remove ASP.NET Designer.cs files?
  • java string with new operator and a literal